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Preface 



This is a book about algorithms for performing arithmetic, and their imple- 
mentation on modern computers. We are concerned with software more than 
hardware — we do not cover computer architecture or the design of computer 
hardware since good books are already available on these topics. Instead we 
focus on algorithms for efficiently performing arithmetic operations such as 
addition, multiplication and division, and their connections to topics such as 
modular arithmetic, greatest common divisors, the Fast Fourier Transform 
(FFT), and the computation of special functions. 

The algorithms that we present are mainly intended for arbitrary-precision 
arithmetic. That is, they are not limited by the computer wordsize of 32 or 
64 bits, only by the memory and time available for the computation. We 
consider both integer and real (floating-point) computations. 

The book is divided into four main chapters, plus one short chapter (es- 
sentially an appendix). Chapter [T] covers integer arithmetic. This has, of 
course, been considered in many other books and papers. However, there 
has been much recent progress, inspired in part by the application to public 
key cryptography, so most of the published books are now partly out of date 
or incomplete. Our aim is to present the latest developments in a concise 
manner. At the same time, we provide a self-contained introduction for the 
reader who is not an expert in the fleld. 

Chapter |2] is concerned with modular arithmetic and the FFT, and their 
applications to computer arithmetic. We consider different number represen- 
tations, fast algorithms for multiplication, division and exponentiation, and 
the use of the Chinese Remainder Theorem (CRT). 

Chapter [3] covers floating-point arithmetic. Our concern is with high- 
precision floating-point arithmetic, implemented in software if the precision 
provided by the hardware (typically IEEE standard 53-bit signiflcand) is in- 
adequate. The algorithms described in this chapter focus on correct rounding, 
extending the IEEE standard to arbitrary precision. 
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Chapter m deals with the computation, to arbitrary precision, of functions 
such as sqrt, exp. In, sin, cos, and more generally functions defined by power 
series or continued fractions. Of course, the computation of special functions 
is a huge topic so we have had to be selective. In particular, we have con- 
centrated on methods that are efficient and suitable for arbitrary-precision 
computations. 

The last chapter contains pointers to implementations, useful web sites, 
mailing lists, and so on. Finally, at the end there is a one-page Summary of 
Complexities which should be a useful aide-memoire. 

The chapters are fairly self-contained, so it is possible to read them out 
of order. For example. Chapter 4 could be read before Chapters 1-3, and 
Chapter 5 can be consulted at any time. Some topics, such as Newton's 
method, appear in different guises in several chapters. Cross-references are 
given where appropriate. 

For details that are omitted we give pointers in the Notes and References 
sections of each chapter, as well as in the bibliography. We have tried, as far 
as possible, to keep the main text uncluttered by footnotes and references, 
so most references are given in the Notes and References sections. 

The book is intended for anyone interested in the design and implemen- 
tation of efficient algorithms for computer arithmetic, and more generally 
efficient numerical algorithms. We did our best to present algorithms that 
are ready to implement in your favorite language, while keeping a high-level 
description and not getting too involved in low-level or machine-dependent 
details. An alphabetical list of algorithms can be found in the index. 

Although the book is not specifically intended as a textbook, it could be 
used in a graduate course in mathematics or computer science, and for this 
reason, as well as to cover topics that could not be discussed at length in the 
text, we have included exercises at the end of each chapter. The exercises 
vary considerably in difficulty, from easy to small research projects, but we 
have not attempted to assign them a numerical rating. For solutions to the 
exercises, please contact the authors. 

We welcome comments and corrections. Please send them to either of the 
authors. 

Richard Brent and Paul Zimmermann 
MCAOrpbrent . com 
Paul . ZimmermannOinr ia . f r 

Canberra and Nancy, February 2010 
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Notation 



C set of complex numbers 

C set of extended complex numbers C U {00} 

N set of natural numbers (nonnegative integers) 

N* set of positive integers N\{0} 

Q set of rational numbers 

M set of real numbers 

Z set of integers 

Z/nZ ring of residues modulo n 

C" set of (real or complex) functions with n continuous derivatives 
in the region of interest 

5R(z) real part of a complex number z 

$5(z) imaginary part of a complex number z 

z conjugate of a complex number z 

\z\ Euclidean norm of a complex number z, 
or absolute value of a scalar z 

Bn Bernoulli numbers, J2n>o ^nZ^ /nl = z/{e^ — 1) 

Cn scaled Bernoulh numbers, C„ = i32n/(2n)! , ^ C„z^" = (z/2)/ tanh(z/2) 

Tn tangent numbers, ^ T„z^"~-^/(2n — 1)! = tanz 

Hn harmonic number Yl^j=i ^/j (0 if n < 0) 

(^) binomial coefficient "n choose /c" = ?i!/(/c! (n — k)l) (0 if /c < or /c > 7i) 
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P "word" base (usually 2^^ or 2^^^) or "radix" (floating-point) 

n "precision" : number of base /3 digits in an integer or in a 

floating-point significand, or a free variable 
£ "machine precision" /3^~"/2 or (in complexity bounds) 

an arbitrarily small positive constant 
77 smallest positive subnormal number 



o(x), On (3^) rounding of real number x in precision n (Definition I3.1.1j l 
ulp(x) for a floating-point number x, one unit in the last place 

M[n) time to multiply n-bit integers, or polynomials of 

degree n — 1, depending on the context 
~M(n) a function /(n) such that f{n)/M[n) — t- 1 as n — t- 00 

(we sometimes lazily omit the "~" if the meaning is clear) 
M(m, n) time to multiply an m-bit integer by an n-bit integer 

D[n) time to divide a 2?i-bit integer by an n-bit integer, 

giving quotient and remainder 
D{m,n) time to divide an m-bit integer by an n-bit integer, 

giving quotient and remainder 

a\h a is a divisor of 6, that \sh = ka for some k £ Z 

a = b mod m modular equality, m\(a — b) 

q ^ a div h assignment of integer quotient lo q < a — qh <h) 

r -(^ a mod b assignment of integer remainder to r {0 < r = a — qb < b) 

{a,b) greatest common divisor of a and b 

(1) or (a|5) Jacobi symbol {b odd and positive) 

iff if and only if 

i A j bitwise and of integers i and j, 

or logical and of two Boolean expressions 
iV j bitwise or of integers i and j, 

or logical or of two Boolean expressions 
i (B j bitwise exclusive- or of integers i and j 

i <^k integer i multiplied by 2^ 

i ^ k quotient of division of integer i by 2^ 

a ■ b, a X b product of scalars a, b 

a*b cyclic convolution of vectors a, b 

v{n) 2-valuation: largest k such that 2^ divides n (i^(0) = 00) 

a{e) length of the shortest addition chain to compute e 

(^(n) Euler's totient function, : < m < n A (m, n) = 1} 
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deg{A) for a polynomial A, the degree of A 

ord(j4) for a power series A = ajz^ , 

ord(^) = min{j : aj ^ 0} (ord(O) = +00) 

exp(3;) or exponential function 

ln(x) natural logarithm 

logft(x) base-5 logarithm ln(x)/ln(6) 

lg(x) base-2 logarithm ln(2;)/ln(2) = log2(x) 

log(x) logarithm to any fixed base 

log'^(x) (logx)'^ 

\x~\ ceiling function, min{n G Z : n > rc} 

floor function, max{n G Z : n < x} 

[x] nearest integer function, [x + 1 /2j 

sign(n) +1 if n > 0, —1 if n < 0, and if n = 

nbits(n) Ug("-)J + 1 if n > 0, if n = 

[a, b] closed interval {x gM : a < x < b} (empty if a > 6) 

(a, b) open interval {x S M : a < x < 6} (empty if a > 6) 

[a, b), (a, b] half-open intervals, a<x<b, a<x<b respectively 



[a,b] or [a,b] column vector 

[a,b;c,d] 2x2 matrix 



a 
b 

a b 
c d 



CLj element of the (forward) Fourier transform of vector a 

CLj element of the backward Fourier transform of vector a 

f{n) = 0{g{n)) 3c, no such that |/(n)| < cg{n) for all n > uq 

f{n) = fl{g{n)) 3c > 0, no such that |/(n)| > cg{n) for all n > uq 

f{n) = @{g{n)) f{n) = 0{g{n)) and g{n) = 0{f{n)) 

f{n) ~ g{n) f{n)/g{n) — )■ 1 as n — )• 00 

/(n) = o{g{n)) f{n)/g{n) — as n — )• cx) 

/(n) « g{n) /(n) = 0{g{n)) 

f{n) > g{n) g{n) < /(n) 

f{^) ~ Eo Oj/^^ /(^) - Eo Oj/^^ = 0(1/2;") as X +00 
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123 456 789 123456789 (for large integers, we may use a space after 

every third digit) 

xxx.yyup a number xxx.yyy written in base p; 

for example, the decimal number 3.25 is II.OI2 in binary 

^ ^ ' ' ' continued fraction a/(b + c/{d + e/{f + ■ ■ ■))) 



\A\ determinant of a matrix A, e.g. 



a b 
c d 



ad — be 



PV fix) dx Cauchy principal value integral, defined by a limit 
if / has a singularity in (a, 6) 

s\\t concatenation of strings s and t 

> <text> comment in an algorithm 

□ end of a proof 



Chapter 1 
Integer Arithmetic 



In this chapter our main topic is integer arithmetic. However, 
we shall see that many algorithms for polynomial arithmetic are 
similar to the corresponding algorithms for integer arithmetic, 
but simpler due to the lack of carries in polynomial arithmetic. 
Consider for example addition: the sum of two polynomials of 
degree n always has degree at most ra, whereas the sum of two 
n-digit integers may have n + 1 digits. Thus we often describe 
algorithms for polynomials as an aid to understanding the corre- 
sponding algorithms for integers. 

1.1 Representation and Notations 

We consider in this chapter algorithms working on integers. We distinguish 
between the logical — or mathematical — representation of an integer, and 
its physical representation on a computer. Our algorithms are intended for 
"large" integers — they are not restricted to integers that can be represented 
in a single computer word. 

Several physical representations are possible. We consider here only the 
most common one, namely a dense representation in a fixed base. Choose 
an integral base (3 > 1. (In case of ambiguity, /3 will be called the internal 
base.) A positive integer A is represented by the length n and the digits 
of its base (3 expansion: 

A = a„_i/3""^ H h + ao, 

where < < /3 — 1, and a„_i is sometimes assumed to be non-zero. 
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Since the base (3 is usually fixed in a given program, only the length n 
and the integers (ai)o<j<n need to be stored. Some common choices for /3 
are 2^^ on a 32-bit computer, or 2^^ on a 64-bit machine; other possible 
choices are respectively 10^ and 10^^ for a decimal representation, or 2^^ 
when using double-precision floating-point registers. Most algorithms given 
in this chapter work in any base; the exceptions are explicitly mentioned. 

We assume that the sign is stored separately from the absolute value. 
This is known as the "sign-magnitude" representation. Zero is an important 
special case; to simplify the algorithms we assume that n = if ^4 = 0, and 
we usually assume that this case is treated separately. 

Except when explicitly mentioned, we assume that all operations are off- 
line, i.e., all inputs (resp. outputs) are completely known at the beginning 
(resp. end) of the algorithm. Different models include lazy and relaxed algo- 
rithms, and are discussed in the Notes and References ( §1.91) . 

1.2 Addition and Subtraction 

As an explanatory example, here is an algorithm for integer addition. In the 
algorithm, d is a carry bit. 

Our algorithms are given in a language which mixes mathematical nota- 
tion and syntax similar to that found in many high-level computer languages. 
It should be straightforward to translate into a language such as C. Note that 
":=" indicates a definition, and "-f-" indicates assignment. Line numbers are 
included if we need to refer to individual lines in the description or analysis 
of the algorithm. 

Algorithm 1.1 Integer Addition 

Input: A = Ylo~^ CLiP\ B = Y1q~^ biP\ carry-in < di^ < 1 
Output: C := Xlo"^ and < < 1 such that A + B + d-.^ = d/]"" + C 
1: d ^ diri 

2: for i from to — 1 do 

3: s ^ ai + bi + d 

4: {d, Cj) ^ (s div 13, s mod (3) 

5: return C, d. 



Let T be the number of different values taken by the data type represent- 
ing the coefficients Oj, hi. (Clearly (3 < T but equality does not necessarily 
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hold, e.g., (3 = 10^ and T = 2^^.) At step [3l the value of s can be as large 
as 2/3 — 1, which is not representable if /3 = T. Several workarounds are 
possible: either use a machine instruction that gives the possible carry of 
tti + bi] or use the fact that, if a carry occurs in ai + bi, then the computed 
sum — if performed modulo T — equals t := ai + bi —T < af, thus comparing 
t and Oj will determine if a carry occurred. A third solution is to keep a bit 
in reserve, taking /3 < T/2. 

The subtraction code is very similar. Step [3] simply becomes s — 
bi + d, where d G {—1, 0} is the borrow of the subtraction, and —(3 < s < f3. 
The other steps are unchanged, with the invariant A — B + d^ = dP"' + C. 

We use the arithmetic complexity model, where cost is measured by the 
number of machine instructions performed, or equivalently (up to a constant 
factor) the time on a single processor. 

Addition and subtraction of n-word integers cost 0{n), which is negligible 
compared to the multiplication cost. However, it is worth trying to reduce 
the constant factor implicit in this 0{n) cost. We shall see in §1.31 that 
"fast" multiplication algorithms are obtained by replacing multiplications by 
additions (usually more additions than the multiplications that they replace). 
Thus, the faster the additions are, the smaller will be the thresholds for 
changing over to the "fast" algorithms. 



1.3 Multiplication 

A nice application of large integer multiplication is the Kronecker-Schdnhage 
trick, also called segmentation or substitution by some authors. Assume 
we want to multiply two polynomials A{x) and B{x) with non-negative in- 
teger coefficients (see Exercise 11.11 for negative coefficients). Assume both 
polynomials have degree less than n, and coefficients are bounded by p. 
Now take a power X = (5^ > np^ of the base (5, and multiply the inte- 
gers a = A{X) and b = B{X) obtained by evaluating A and B at x = X. 
If C{x) = A{x)B{x) = ^Qx*, we clearly have C{X) = ^qX*. Now since 
the Cj are bounded by np"^ < X, the coefficients q can be retrieved by simply 
"reading" blocks of k words in C{X). Assume for example that we want to 
compute 



(6x^ + + + + X + 3)(7x^ + x^ + 2x^ + x + 7), 
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with degree less than n = 6, and coefficients bounded by p = 9. We can take 
X = 10^ > np^, and perform the integer multiphcation: 

6 006 004 009 001 003 x 7 001 002 001 007 
= 42 048 046 085 072 086 042 070 010 021, 

from which we can read off the product 

42x^ + 48x^ + 46a;^ + 85a;'' + 72a;^ + 86a;^ + 42a;^ + 70x'^ + lOx + 21. 

Conversely, suppose we want to multiply two integers a = ^o<j<n '^«/^* 
and b = J2o<j<n^3t^'^ ■ Multiply the polynomials A{x) = J2o<i<n^i^^ 
B{x) = Xlo<j<n ^i-^"'' obtaining a polynomial C{x), then evaluate C{x) at 
X = P to obtain ab. Note that the coefficients of C{x) may be larger than /3, 
in fact they may be up to about ra/?^. For example, with a = 123, b = 456, 
and /3 = 10, we obtain A{x) = x^ + 2x + 3, B{x) = 4x^ + 5x + 6, with product 
C{x) = Ax^ + 13x3 + 28x2 + 27x + 18, and C(10) = 56088. These examples 
demonstrate the analogy between operations on polynomials and integers, 
and also show the limits of the analogy. 

A common and very useful notation is to let M{n) denote the time to mul- 
tiply n-bit integers, or polynomials of degree n — 1, depending on the context. 
In the polynomial case, we assume that the cost of multiplying coefficients is 
constant; this is known as the arithmetic complexity model, whereas the bit 
complexity model also takes into account the cost of multiplying coefficients, 
and thus their bit-size. 

1.3.1 Naive Multiplication 



Algorithm 1.2 BasecaseMultiply 

Output: C = AB:= ^"^''"^ Ck/3'' 
1: C ^ A-bo 

2: for j from 1 to n — 1 do 

3: C + {3^{A-bj) 

4: return C. 
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Theorem 1.3.1 Algorithm BasecaseMultiply computes the product AB 
correctly, and uses Q{mn) word operations. 

The muhiphcation by at step [3] is trivial with the chosen dense represen- 
tation: it simply requires shifting by j words towards the most significant 
words. The main operation in Algorithm BasecaseMultiply is the compu- 
tation of A ■ bj and its accumulation into C at step [31 Since all fast algorithms 
rely on multiplication, the most important operation to optimize in multiple- 
precision software is thus the multiplication of an array of m words by one 
word, with accumulation of the result in another array of m + 1 words. 

We sometimes call Algorithm BasecaseMultiply schoolbook multiplica- 
tion since it is close to the "long multiplication" algorithm that used to be 
taught at school. 

Since multiplication with accumulation usually makes extensive use of the 
pipeline, it is best to give it arrays that are as long as possible, which means 
that A rather than B should be the operand of larger size (i.e., m > n). 

1.3.2 Karatsuba's Algorithm 

Karatsuba's algorithm is a "divide and conquer" algorithm for multiplication 
of integers (or polynomials). The idea is to reduce a multiplication of length 
n to three multiplications of length n/2, plus some overhead that costs 0{n). 

In the following, Uq > 2 denotes the threshold between naive multiplica- 
tion and Karatsuba's algorithm, which is used for riQ-word and larger inputs. 
The optimal "Karatsuba threshold" no can vary from about 10 to about 100 
words, depending on the processor and on the relative cost of multiplication 
and addition (see Exercise II. 6p . 

Theorem 1.3.2 74/(7oni/im KaratsubaMultiply computes the product AB 
correctly, using K{n) = 0{n°') word multiplications, with a = lg3 «i 1.585. 

Proof. Since s^l^o — Ai\ = Aq — Ai and sb\Bo — Bi\ = Bq — Bi, we 
have saSb\Ao - Ai\\Bo - Bx\ = {Aq - Ai){Bo - Bi), and thus C = A0B0+ 
{AoB, + A^Bo)P'^ + A^B^P^K 

Since Aq, Bq, \Aq — Ai\ and \Bq — Bi\ have (at most) \n/2\ words, and Ai 
and Bi have (at most) [n/2j words, the number K{n) of word multiplications 
satisfies the recurrence K{n) = for n < uq, and K{n) = 2K{\n/2]) + 
K{\n/2\) for n > uq. Assume 2^~^no < n < 2^no with i > 1. Then K{n) 
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Algorithm 1.3 KaratsubaMultiply 
Input: A = B = Eo"' bjf3^ 

Output: C = AB:= J2T~^ CkP^ 

if n < rio then return BasecaseMultiply(A, 5) 

k <- \n/2] 

(Ao, Bo) := (A, B) mod /3^ {A^, B^) := {A, B) div /3 
sa sign(Ao - Ai), SB <- sign(5o - Bi) 
Co ^ KaratsubaMultiply(ylo,-Bo) 
Ci ^ KaratsubaMultiply 

C2 ^ KaratsubaMultiply ( I Ao - Ai\,\Bo - Bi\) 
return C := Co + (Co + Ci - saSbCs)/?'^ + Ci/?^^ 



is the sum of three K{j) values with j < 2^ ^rio, so at most 3^ K{j) with 
j < no. Thus K{n) < 3^max(i^(no), {no — 1)^), which gives K{n) < Cn"' 
with C = 3i-is("o)max(if (no), (no - l)^). □ 

Different variants of Karatsuba's algorithm exist; the variant presented 
here is known as the subtractive version. Another classical one is the additive 
version, which uses Ao + Ai and Bq + Bi instead of |y4o — Ai| and |i?o — 
However, the subtractive version is more convenient for integer arithmetic, 
since it avoids the possible carries in Ao + Ai and Bo + Bi, which require 
either an extra word in these sums, or extra additions. 

The efficiency of an implementation of Karatsuba's algorithm depends 
heavily on memory usage. It is important to avoid allocating memory for 
the intermediate results |Ao — \Bo — Bi\, Co, Ci, and C2 at each step 
(although modern compilers are quite good at optimising code and removing 
unnecessary memory references). One possible solution is to allow a large 
temporary storage of m words, used both for the intermediate results and 
for the recursive calls. It can be shown that an auxiliary space of m = 2n 
words — or even m = O(logn) — is sufficient (see Exercises 11.71 and 11.81) . 

Since the product C2 is used only once, it may be faster to have aux- 
iliary routines KaratsubaAddmul and KaratsubaSubmul that accumu- 
late their result, calling themselves recursively, together with Karatsuba- 
Multiply (see Exercise ll.lOp . 

The version presented here uses ~4n additions (or subtractions): 2x (n/2) 
to compute |y4o — Ai| and \Bo — Bi\, then n to add Co and Ci, again n to 
add or subtract C2, and n to add (Co + Ci — saSbC2)(3'' to Co + An 
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improved scheme uses only ~7?i/2 additions (see Exercise ll.9p . 

When considered as algorithms on polynomials, most fast multiplication 
algorithms can be viewed as evaluation/interpolation algorithms. Karat- 
suba's algorithm regards the inputs as polynomials Aq + Aix and Bq + Bix 
evaluated at x = /3^; since their product C{x) is of degree 2, Lagrange's in- 
terpolation theorem says that it is sufficient to evaluate C{x) at three points. 
The subtractive version evaluate^] C{x) at x = 0, —1, oo, whereas the addi- 
tive version uses x = 0, -fl, oo. 



1.3.3 Toom-Cook Multiplication 

Karatsuba's idea readily generalizes to what is known as Toom-Cook r-way 
multiplication. Write the inputs as ao + - ■ ■ + ar-ix^~^ and bo + - ■ ■ + br-ix^~^, 
with X = P'', and k = [n/r]. Since their product C{x) is of degree 2r — 2, 
it suffices to evaluate it at 2r — 1 distinct points to be able to recover C{x), 
and in particular C{P^). If r is chosen optimally, Toom-Cook multiplication 
of ra-word numbers takes time n^+oC^/^Aogn) ^ 

Most references, when describing subquadratic multiplication algorithms, 
only describe Karatsuba and FFT-based algorithms. Nevertheless, the Toom- 
Cook algorithm is quite interesting in practice. 

Toom-Cook r-way reduces one n-word product to 2r — 1 products of 
about n/r words, thus costs 0{n'^) with v = log(2r — 1)/ logr. However, the 
constant hidden by the big-0 notation depends strongly on the evaluation 
and interpolation formulas, which in turn depend on the chosen points. One 
possibility is to take — (r — 1), . . . , —1, 0, 1, . . . , (r — 1) as evaluation points. 

The case r = 2 corresponds to Karatsuba's algorithm ( §1.3.2p . The 
case r = 3 is known as Toom-Cook 3-way, sometimes simply called "the 
Toom-Cook algorithm". Algorithm ToomCookS uses evaluation points 
0, 1, —1, 2, oo, and tries to optimize the evaluation and interpolation formulae. 

The divisions at step Elare exact; if /9 is a power of two, the division by 
6 can be done using a division by 2 — which consists of a single shift — 
followed by a division by 3 (see §1.4.7p . 

Toom-Cook r-way has to invert a (2r — 1) x (2r — 1) Vandermonde matrix 
with parameters the evaluation points; if one chooses consecutive integer 
points, the determinant of that matrix contains all primes up to 2r — 2. This 
proves that division by (a multiple of) 3 can not be avoided for Toom-Cook 



^ Evaluating C{x) at oo means computing the product AiBi of the leading coefficients. 
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Algorithm 1.4 ToomCookS 



Input: two integers < A, B < (3"' 

Output: AB -.= 00 + Cil3^ + ca/?^'^ + c^l3'^^ + 0^(5*'' witli k = \n/3] 
Require: a threshold rii > 3 
1: if n < rii then return KaratsubaMultiply(A, 5) 

write A = ao + aix + a2X^, B = + bix + b2x'^ with x = /3^. 
vq ^ ToomCook3(ao, &o) 

vi ^ ToomCook3(ao2 + ai, &02 + ^1) where ao2 ^ cto + «2, &02 ^ &o + ^2 
V-i ^ ToomCook3(ao2 — ^i, ^02 — bi) 
V2 ^ ToomCook3(ao + 2ai + 4a2, bo + 26i + 462) 
foo ^ ToomCook3(a2, ^2) 

ti ^ {3vo + 2t;_i + t;2)/6 - 2v^, t2 ^ (vi + v^i)/2 

Co ^ t;o, Ci^Vi- ti, C2^t2-Vo- Voo, Cs^ti- t2, C4 ^ i;oo- 



3-way with consecutive integer points. See Exercise 11.141 for a generalization 
of this result. 



1.3.4 Use of the Fast Fourier Transform (FFT) 

Most subquadratic multiplication algorithms can be seen as evaluation-inter- 
polation algorithms. They mainly differ in the number of evaluation points, 
and the values of those points. However, the evaluation and interpolation 
formula become intricate in Toom-Cook r-way for large r, since they involve 
O(r^) scalar operations. The Fast Fourier Transform (FFT) is a way to 
perform evaluation and interpolation efficiently for some special points (roots 
of unity) and special values of r. This explains why multiplication algorithms 
with the best known asymptotic complexity are based on the Fast Fourier 
transform. 

There are different flavours of FFT multiplication, depending on the ring 
where the operations are performed. The Schonhage-Strassen algorithm, 
with a complexity of 0{n logn loglogn), works in the ring Z/(2'" + 1)Z. Since 
it is based on modular computations, we describe it in Chapter [2l 

Other commonly used algorithms work with floating-point complex num- 
bers. A drawback is that, due to the inexact nature of floating-point com- 
putations, a careful error analysis is required to guarantee the correctness of 
the implementation, assuming an underlying arithmetic with rigorous error 
bounds. See Theorem 13.3.21 in Chapter [3l 
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We say that multiplication is in the FFT range if n is large and the 
multiplication algorithm satisfies M(2n) ~ 2M(n). For example, this is true 
if the Schonhage-Strassen multiplication algorithm is used, but not if the 
classical algorithm or Karatsuba's algorithm is used. 

1.3.5 Unbalanced Multiplication 

The subquadratic algorithms considered so far (Karatsuba and Toom-Cook) 
work with equal-size operands. How do we efficiently multiply integers of 
different sizes with a subquadratic algorithm? This case is important in 
practice but is rarely considered in the literature. Assume the larger operand 
has size m, and the smaller has size n < m, and denote by M{m, n) the 
corresponding multiplication cost. 

If evaluation-interpolation algorithms are used, the cost depends mainly 
on the size of the result, that is m + ra, so we have M{m,n) < M{{m + n)/2), 
at least approximately. We can do better than M((m + n)/2) if n is much 
smaller than m, for example M{m, 1) = 0{m). 

When m is an exact multiple of n, say m = kn, a trivial strategy is to 
cut the larger operand into k pieces, giving M{kn,n) = kM{n) + 0{kn). 
However, this is not always the best strategy, see Exercise 11.161 

When m is not an exact multiple of n, several strategies are possible: 

• split the two operands into an equal number of pieces of unequal sizes; 

• or split the two operands into different numbers of pieces. 

Each strategy has advantages and disadvantages. We discuss each in turn. 

First Strategy: Equal Number of Pieces of Unequal Sizes 

Consider for example Karatsuba multiplication, and let K{m, n) be the num- 
ber of word-products for an m x n product. Take for example m = 5, n = 3. 
A natural idea is to pad the smallest operand to the size of the largest one. 
However there are several ways to perform this padding, as shown in the fol- 
lowing figure, where the "Karatsuba cut" is represented by a double column: 
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The left variant leads to two products of size 3, i.e., 2K{3, 3), the middle one 
to K{2, 1) + (3, 2) + K(3, 3), and the right one to K{2, 2) + K{3, 1) + K{3, 3), 
which give respectively 14, 15, 13 word products. 

However, whenever m/2 < n < m, any such "padding variant" will re- 
quire K{\m/2'], [m/2]) for the product of the differences (or sums) of the 
low and high parts from the operands, due to a "wrap-around" effect when 
subtracting the parts from the smaller operand; this will ultimately lead to 
a cost similar to that of an m x m product. The "odd-even scheme" of 
Algorithm OddEvenKaratsuba (see also Exercise 11.131) avoids this wrap- 
around. Here is an example of this algorithm for m = 3 and n = 2. Take 



Algorithm 1.5 OddEvenKaratsuba 

Input: A = XlcT^^ ^ — Ylo~^ hjX^ , m > n > 1 
Output: A-B 

if n = 1 then return aihQX^ 

k 4- [m/2] , £ ^ \n/2\ 

write A = Ao(x^) + xAi{x^), B = Bq{x'^) + xBi{x'^) 

Co -f- OddEvenKaratsuba(Ao, Bq) 

Ci ^ OddEvenKaratsuba (Ao + Ai, Bq + Bi) 

C2 ^ OddEvenKaratsuba (Ai, i?i) 

return Co(x2) + x{Ci - Co - C2){x^) + x'^C2{x'^). 



A = a2x'^ + aix + ao and B = hix + h^. This yields Aq = a2X + ag, Ai = ai. 
Bo = ho, Bi = hi, thus Co = {a2X + ao)ho, Ci = {a2X + ao + ai)(6o + hi), 
C2 = aihi. We thus get i^(3, 2) = 2K{2, 1) + K{1) = 5 with the odd-even 
scheme. The general recurrence for the odd-even scheme is: 

K{m, n) = 2K{ [m /2] ,\n/2]) + K{[m /2j , [n /2j ) , 

instead of 



K{m,n) = 2K{\m/2], \m/2]) + K{[m/2\,n - \m./2]) 
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for the classical variant, assuming n > m/2. We see that the second param- 
eter in K{-, ■) only depends on the smaller size n for the odd-even scheme. 

As for the classical variant, there are several ways of padding with the odd- 
even scheme. Consider m = 5, n = 3, and write A := a4x'^+a3X^+a2x'^+aix+ 
ao = xAi^x"^) + Ao{x^), with Ai{x) = a^x + ai, Ao{x) = a^x"^ + a2X + ao; and 
B := h2x'^ + hiX + hQ = xBi{x'^) + Bo{x'^), with Bi{x) = bi, Bq{x) = b2X + bo. 
Without padding, we write AB = x'^{AiBi){x'^) + x{{Ao + Ai){Bo + Bi) - 
AiBi - AoBo){x^) + {AoBo){x^), which gives K{5, 3) = K{2, 1) + 2K{3, 2) = 
12. With padding, we consider xB = xB[{x^) + Bq{x'^), with B[{x) = 
b2X + bo, B'q = bix. This gives i^(2,2) = 3 for AiB[, K{3,2) = 5 for 
{Aq + Ai){B'q + B[), and K{3, 1) = 3 for AqBq — taking into account the 
fact that B'q has only one non-zero coefficient — thus a total of 1 1 only. 

Note that when the variable x corresponds to say f3 = 2^^, Algorithm 
OddEvenKaratsuba as presented above is not very practical in the integer 
case, because of a problem with carries. For example, in the sum Aq + Ai we 
have [m/2 J carries to store. A workaround is to consider x to be say P^^, in 
which case we have to store only one carry bit for 10 words, instead of one 
carry bit per word. 

The first strategy, which consists in cutting the operands into an equal 
number of pieces of unequal sizes, does not scale up nicely. Assume for 
example that we want to multiply a number of 999 words by another number 
of 699 words, using Toom-Cook 3-way. With the classical variant — without 
padding — and a "large" base of 13^^^, we cut the larger operand into three 
pieces of 333 words and the smaller one into two pieces of 333 words and 
one small piece of 33 words. This gives four full 333 x 333 products — 
ignoring carries — and one unbalanced 333 x 33 product (for the evaluation 
at a; = oo). The "odd-even" variant cuts the larger operand into three pieces 
of 333 words, and the smaller operand into three pieces of 233 words, giving 
rise to five equally unbalanced 333 x 233 products, again ignoring carries. 

Second Strategy: Different Number of Pieces of Equal Sizes 

Instead of splitting unbalanced operands into an equal number of pieces — 
which are then necessarily of different sizes — an alternative strategy is to 
split the operands into a different number of pieces, and use a multiplica- 
tion algorithm which is naturally unbalanced. Consider again the example 
of multiplying two numbers of 999 and 699 words. Assume we have a multi- 
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plication algorithm, say Toom-(3, 2), which multiplies a number of 3n words 
by another number of 2n words; this requires four products of numbers of 
about n words. Using n = 350, we can split the larger number into two pieces 
of 350 words, and one piece of 299 words, and the smaller number into one 
piece of 350 words and one piece of 349 words. 

Similarly, for two inputs of 1000 and 500 words, we can use a Toom- 
(4, 2) algorithm which multiplies two numbers of An and 2n words, with 
n = 250. Such an algorithm requires five evaluation points; if we choose the 
same points as for Toom 3-way, then the interpolation phase can be shared 
between both implementations. 

It seems that this second strategy is not compatible with the "odd-even" 
variant, which requires that both operands are cut into the same number of 
pieces. Consider for example the "odd-even" variant modulo 3. It writes the 
numbers to be multiplied a.s A = a{(3) and B = b{f3) with a(t) = ao{t^) + 
tai{t^) + t'^a2{t^), and similarly b{t) = bo{t^) + tbi{t^) + t%2{t^) ■ We see that 
the number of pieces of each operand is the chosen modulus, here 3 (see 
Exercise II. lip . 

Asymptotic complexity of unbalanced multiplication 

Suppose m > n and n is large. To use an evaluation-interpolation scheme 
we need to evaluate the product at m + n points, whereas balanced k hj k 
multiplication needs 2k points. Taking k ^ {m+n)/2, we see that M{m, n) < 
M{{m + n)/2)(l -|- o(l)) as n — )■ oo. On the other hand, from the discussion 
above, we have M{m,n) < \m/n]M{n). This explains the upper bound on 
M[m, n) given in the Summary of Complexities at the end of the book. 

1.3.6 Squaring 

In many applications, a significant proportion of the multiplications have 
equal operands, i.e., are squarings. Hence it is worth tuning a special squar- 
ing implementation as much as the implementation of multiplication itself, 
bearing in mind that the best possible speedup is two (see Exercise I1.17p . 

For naive multiplication. Algorithm BasecaseMultiply ( §1.3.10 can be 
modified to obtain a theoretical speedup of two, since only about half of the 
products aibj need to be computed. 

Subquadratic algorithms like Karatsuba and Toom-Cook r-way can be 
specialized for squaring too. In general, the threshold obtained is larger than 
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Figure 1.1: The best algorithm to multiply two numbers of x and y words 
for 4: < X < y < 200: be is schoolbook multiplication, 22 is Karatsuba's 
algorithm, 33 is Toom-3, 32 is Toom-(3,2), 44 is Toom-4, and 42 is Toom- 
(4, 2). This graph was obtained on a Core 2, with GMP 5.0.0, and GCC 4.4.2. 
Note that for x < (y + 3)/4, only the schoolbook multiplication is available; 
since we did not consider the algorithm that cuts the larger operand into 
several pieces, this explains why be is best for say x = 46 and y = 200. 
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Figure 1.2: Ratio of the squaring and multiplication time for the GNU MP 
library, version 5.0.0, on a Core 2 processor, up to one million words. 

the corresponding multiplication threshold. For example, on a modern 64-bit 
computer, one can expect a threshold between the naive quadratic squaring 
and Karatsuba's algorithm in the 30-word range, between Karatsuba's and 
Toom-Cook 3-way in the 100-word range, between Toom-Cook 3-way and 
Toom-Cook 4-way in the 150-word range, and between Toom-Cook 4-way 
and the FFT in the 2500-word range. 



Figure [L2] compares the multiplication and squaring time with the GNU MP 
library. It shows that whatever the word range, a good rule of thumb is to 
count 2/3 of the cost of a product for a squaring. 

The classical approach for fast squaring is to take a fast multiplica- 
tion algorithm, say Toom-Cook r-way, and to replace the 2r — 1 recursive 
products by 2r — 1 recursive squarings. For example, starting from Algo- 
rithm ToomCookS, we obtain five recursive squarings Oq, (oq + ai + ^2)^, 
(oo — ai + 02)^5 {clq + 2ai -|-4a2)^, and a^. A different approach, called asym- 
metric squaring, is to allow products which are not squares in the recursive 
calls. For example, the square of a2f3^+ai(3-\-aQ is Ci(3^ + C3f3^+C2f3'^ + Cif3-\-co, 
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where C4 = a|, C3 = 2aia2, C2 = cq + C4 — s, ci = 2aiao, and cq = Oq, where 
s = (oo — a2 + ai)(ao — 02 — Oi)- This formula performs two squarings, and 
three normal products. Such asymmetric squaring formulae are not asymp- 
totically optimal, but might be faster in some medium range, due to simpler 
evaluation or interpolation phases. 

1.3.7 Multiplication by a Constant 

It often happens that the same multiplier is used in several consecutive oper- 
ations, or even for a complete calculation. If this constant multiplier is small, 
i.e., less than the base f3, not much speedup can be obtained compared to 
the usual product. We thus consider here a "large" constant multiplier. 

When using evaluation-interpolation algorithms, like Karatsuba or Toom- 
Cook (see §1.3.2f[r.3.3p . one may store the evaluations for that fixed multi- 
plier at the different points chosen. 

Special-purpose algorithms also exist. These algorithms differ from clas- 
sical multiplication algorithms because they take into account the value of 
the given constant multiplier, and not only its size in bits or digits. They 
also differ in the model of complexity used. For example, R. Bernstein's 
algorithm [27], which is used by several compilers to compute addresses in 
data structure records, considers as basic operation x,y 2^x ± y, with a 
cost assumed to be independent of the integer i. 

For example, Bernstein's algorithm computes 20061a; in five steps: 



1.4 Division 

Division is the next operation to consider after multiplication. Optimizing 
division is almost as important as optimizing multiplication, since division is 
usually more expensive, thus the speedup obtained on division will be more 
significant. On the other hand, one usually performs more multiplications 
than divisions. 



Xi := Six 
X2 := 93a; 
X3 := 743x 
a;4 := 6687x 
20061a; 



2^a; — X 
2^X1 + Xi 
2^X2 — X 
2^x3 + X3 
2^Xi + X4. 
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One strategy is to avoid divisions when possible, or replace them by 
multiplications. An example is when the same divisor is used for several 
consecutive operations; one can then precompute its inverse (see §2.4.ip . 

We distinguish several kinds of division: full division computes both quo- 
tient and remainder, while in other cases only the quotient (for example, 
when dividing two floating-point significands) or remainder (when multiply- 
ing two residues modulo n) is needed. We also discuss exact division — when 
the remainder is known to be zero — and the problem of dividing by a single 
word. 



1.4.1 Naive Division 

In all division algorithms, we assume that divisors are normalized. We say 
that B := Ylo~^ ^jP^ normalized when its most significant word hn-i satis- 
fies 6„_i > /3/2. This is a stricter condition (for /3 > 2) than simply requiring 
that hn-i be nonzero. 



Algorithm 1.6 BasecaseDivRem 



Input: A = Ylo~^^ ^ Qj/?*, B = ^ bjP\ B normalized, m > 
Output: quotient Q and remainder R of A divided by B 

1: if A > then q^.^l^A^A- /S'^B else ^ 

2: for j from m — 1 downto do 

3: q* ^ [{an+j(3 + > quotient selection step 

4: Qj ^ mm{q* , f3 - 1) 

5: A^ A- QjP^B 

6: while A < do 

7: qj <-qj-l 

8: A<- A + 13^ B 

9: return Q = ^™ qjl^-' , R = A. 
(Note: in step |3l denotes the current value of the i-th word of A, which 
may be modified at steps [5] and El) 



If B is not normalized, we can compute A' = 2^ A and B' = 2^B so that 
B' is normalized, then divide A' by B' giving A = Q'B'+R'; the quotient and 
remainder of the division of A by i? are respectively Q := Q' and R := R' /2^, 
the latter division being exact. 
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Theorem 1.4.1 Algorithm BasecaseDivRem correctly computes the quo- 
tient and remainder of the division of A by a normalized B, in 0{n{m + 1)) 
word operations. 

Proof. We prove that the invariant A < fi^^^B holds at step |2j This holds 
trivially for j = m — 1: B being normalized, A < 2P"^B initially. 

First consider the case qj = q*: then qjbn-i > CLn+jf^ + On+j-i — ^n-i + 1, 
thus 

A - qj(3^B < {hn-i - + {A mod 

which ensures that the new a„+j vanishes, and an+j-i < bn-i, thus A < fi'-' B 
after step [5l Now A may become negative after step [5], but since qjbn-i < 
an+jl3 + fln+i-i, we have: 

A-q^P^B > (a„+,/3 + a„,+,_i)/3"+^'-i-g,(&„_i/3"-i+/3"-i)/3^' 

Therefore A- qj(3^ B + 2(3^ B > (26„_i > 0, which proves that the 

while-loop at steps [MHl is performed at most twice |143l Theorem 4.3. l.B]. 
When the while- loop is entered, A may increase only by (5^B at a time, hence 
A< (3^B at exit. 

In the case qj ^ q*, i.e., q* > /3, we have before the while- loop: A < 
(3j+^B - {(3 - 1)I3^B = I3^B, thus the invariant holds. If the while-loop is 
entered, the same reasoning as above holds. 

We conclude that when the for-loop ends, < A < B holds, and since 
qjP^)B -|- A is invariant throughout the algorithm, the quotient Q and 
remainder R are correct. 

The most expensive part is step [5], which costs 0{n) operations for qjB 
(the multiplication by [i^ is simply a word-shift); the total cost is 0{n{m+l)). 
(For m = we need 0(n) work ii A> B, and even ii A < B io compare the 
inputs in the case A = B — \.) g 

Here is an example of algorithm BasecaseDivRem for the inputs 
A = 766 970 544 842 443 844 and B = 862 664 913, with (3 = 1000, which 
gives quotient Q = 889 071 217 and remainder R = 778 334 723. 

j A qj A — qjB(3^ after correction 



2 766 970 544 842 443 844 889 61437185 443 844 no change 
1 61437185 443 844 071 187976 620 844 no change 

187976 620 844 218 -84 330190 778 334 723 
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Algorithm BasecaseDivRem simplifies when A < I3"^B: remove step[Il 
and change m into m — 1 in the return value Q. However, the more general 
form we give is more convenient for a computer implementation, and will be 
used below. 

A possible variant when q* > (3 is to let qj = f3] then A — qjf3W at step 
[5] reduces to a single subtraction of B shifted by j ' + 1 words. However in 
this case the while-loop will be performed at least once, which corresponds 
to the identity A - {l3 - 1)(3^B = A- l3^+^B + l3^B. 

If instead of having B normalized, i.e., 6„ > /3/2, one has 6„ > f3/k, there 
can be up to k iterations of the while- loop (and step [1] has to be modified). 

A drawback of Algorithm BasecaseDivRem is that the test A < 
at line [6] is true with non-negligible probability, therefore branch prediction 
algorithms available on modern processors will fail, resulting in wasted cycles. 
A workaround is to compute a more accurate partial quotient, in order to 
decrease the proportion of corrections to almost zero (see Exercise ll.20p . 

1.4.2 Divisor Preconditioning 

Sometimes the quotient selection — step [3] of Algorithm BasecaseDivRem 
— is quite expensive compared to the total cost, especially for small sizes. 
Indeed, some processors do not have a machine instruction for the division 
of two words by one word; one way to compute q* is then to precompute a 
one- word approximation of the inverse of bn-i, and to multiply it by an+jf3 + 

O^n+j-l- 

Svoboda's algorithm makes the quotient selection trivial, after precon- 
ditioning the divisor. The main idea is that if bn-i equals the base (3 in 
Algorithm BasecaseDivRem, then the quotient selection is easy, since it 
suffices to take q* = a„+j. (In addition, g* < /3 — 1 is then always fulfilled, 
thus step m of BasecaseDivRem can be avoided, and q* replaced by qj.) 

With the example of §1.4. H Svoboda's algorithm would give k = 1160, 
B' = 1000 691299 080: 

j A qj A — qjB'P^ after correction 



2 766 970 544 842 443 844 766 441009 747163 844 no change 
1 441009 747163 844 441 -295115 730 436 705 575 568 644 
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Algorithm 1.7 SvobodaDivision 

Input: A = Eo^™"^ "i/^*' ^ = Eo'^ ^i/^^ normalized, A < m > 1 

Output: quotient Q and remainder R of A divided by B 

1: r/3"+V5] 

2: B' ^ kB = + b'j(3^ 
3: for j from m — 1 downto 1 do 

4: qj ttn+j > current value of ttn+j 

5: A^A- qjl3^-^B' 

6: if A < then 

7: ^ - 1 

8: A^ A + f3^-^B' 

10: (go, R) div B, R! mod i?) > using BasecaseDivRem 

11: return Q = kQ' + go, R. 



We thus get Q' = 766 440 and R' = 705 575 568 644. The final division of 
stepdigives R' = 8175 + 778 334 723, thus we get Q = 1 160-766 440 + 817 = 
889 071 217, and R = 778 334 723, as in gHH 

Svoboda's algorithm is especially interesting when only the remainder is 
needed, since then one can avoid the "deconditioning" Q = kQ' + go. Note 
that when only the quotient is needed, dividing A' = kA by B' = kB is 
another way to compute it. 

1.4.3 Divide and Conquer Division 

The base-case division of §1.4.11 determines the quotient word by word. A 
natural idea is to try getting several words at a time, for example replacing 
the quotient selection step in Algorithm BasecaseDivRem by: 

Q-n+jP^ + O.n+j-lP'^ + + dn+j-S 

bn-l(3 + bn-2 

Since g* has then two words, fast multiplication algorithms ( §1.3p might 
speed up the computation of qjB at stepOof Algorithm BasecaseDivRem. 

More generally, the most significant half of the quotient — say Qi, of i = 
m—k words — mainly depends on the ^ most significant words of the dividend 
and divisor. Once a good approximation to Qi is known, fast multiplication 
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algorithms can be used to compute the partial remainder A — QiBP^. The 
second idea of the divide and conquer algorithm RecursiveDivRem is to 
compute the corresponding remainder together with the partial quotient Qi, 
in such a way, one only has to subtract the product of Qi by the low part of 
the divisor, before computing the low part of the quotient. 

Algorithm 1.8 RecursiveDivRem 

Input: A = Ylo~^"^~^ CLil3\ B = ^o~^ bjP^ , B normalized, n > m 
Output: quotient Q and remainder R of A divided by B 

1: if m < 2 then return BasecaseDivRem(y4, B) 

2: [m/2\ , Bi^ B div /3^ Bq ^ B mod 

3: (Qi, Ri) ^ RecursiveDivRem (A div /J^'', Bi) 

4: A' ^ RiP^'' + {A mod /J^^) - QiBqP'' 

5: while A' <0 do Qi^Qi- 1, A' ^ A' + P^B 

6: (Qo, Rq) ^ RecursiveDivRem(A' div {3^, Bi) 

7: A" <- R^P^ + {A' mod - Q^B^, 

8: while A" < do Qo ^ <5o - 1, A" ^ A" + B 

9: return Q := QiP^ + Qo, R ■= A". 



In Algorithm RecursiveDivRem, one may replace the condition m < 2 
at step [Hby m < T for any integer T > 2. In practice, T is usually in the 
range 50 to 200. 

One can not require A < (5^B at input, since this condition may not be 
satisfied in the recursive calls. Consider for example A = 5517, B = 56 with 
(3 = 10: the first recursive call will divide 55 by 5, which yields a two-digit 
quotient 11. Even A < (i^B is not recursively fulfilled, as this example 
shows. The weakest possible input condition is that the n most significant 
words of A do not exceed those of -B, i.e., A < P'^{B + 1). In that case, the 
quotient is bounded by /3™ + — 1)/B\, which yields Z?™ + 1 in the case 
n = m (compare Exercise 11.191) . See also Exercise ll.22[ 

Theorem 1.4.2 Algorithm RecursiveDivRem is correct, and uses 
D{n + m,n) operations, where D{n + m,n) = 2D{n,n — m/2) + 2M{m/2) + 
0{n). In particular D in) := D{2n,n) satisfies D{n) = 2D{n/2) + 2M{n/2) + 
0{n), which gives D{n) ~ M(n)/(2°-^ - 1) for M{n) ^n°', a> 1. 

Proof. We first check the assumption for the recursive calls: Bi is normal- 
ized since it has the same most significant word than B. 
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After step El we have A = (QiBi + + {A mod (32k), thus after step 

m A' = A — Qi(3^B, which still holds after step |5l After step |6l we have 
A' = {QoBi + i?o)/3^ + (A' mod /3^), thus after step [3 A" = A' - QqB, which 
still holds after step [HI At step |9] we thus have A = QB + R. 

A div P"^^ has m + n — 2k words, while Bi has n — words, thus < Qi < 
2/3'"-'= and < i?i < fii < /J'^-^ Thus at step H -2/3'"+^' < A' < 
Since i? is normalized, the while-loop at step [5] is performed at most four 
times (this can happen only when n = m). At step El we have < A' < P^B, 
thus A' div has at most n words. 

It follows < Qo < 2/3^= and < i?o < < P""'^- Hence at step 
[71 —2/3^'=' < A" < B, and after at most four iterations at step [HI we have 
Q<A"<B. □ 

Theorem 11.4.21 gives D{n) ~ 2M{n) for Karatsuba multiplication, and 
D{n) ~ 2.63M(n) for Toom-Cook 3-way; in the FFT range, see Exercise ll.23[ 

The same idea as in Exercise II. 20l applies: to decrease the probability that 
the estimated quotients Qi and Qq are too large, use one extra word of the 
truncated dividend and divisors in the recursive calls to RecursiveDivRem. 

A graphical view of Algorithm RecursiveDivRem in the case m = n 
is given in Figure II. 3[ which represents the multiplication Q ■ B: one first 
computes the lower left corner in D{n/2) (step second the lower right 
corner in M{n/2) (step [4]), third the upper left corner in D{n/2) (step[6l), 
and finally the upper right corner in M{n/2) (step [7]). 

Unbalanced Division 

The condition n > m in Algorithm RecursiveDivRem means that the 
dividend A is at most twice as large as the divisor B. 

When A is more than twice as large as B [m > n with the notation above), 
a possible strategy (see Exercise ll.24p computes n words of the quotient at 
a time. This reduces to the base-case algorithm, replacing /3 by /S". 

Figure 11.41 compares unbalanced multiplication and division in GNU MP. 
As expected, multiplying x words hy n — x words takes the same time as 
multiplying n — x words by n words. However, there is no symmetry for the 
division, since dividing n words by x words for x < n/2 is more expensive, 
at least for the version of GMP that we used, than dividing n words hy n — x 
words. 
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Figure 1.3: Divide and conquer division: a graphical view (most significant 
parts at the lower left corner). 



Algorithm 1.9 UnbalancedDivision 

Input: A = Ylo^"^^^ B = ^g"^ bjP^ , B normalized, m > n 
Output: quotient Q and remainder R oi A divided by B 

while m > n do 

(g, r) ^ RecursiveDivRem(A div I3^~'^, B) > 2n by n division 

g ^ + g 

A ^ + A mod 

m m — n 
{q, r) ^ RecursiveDivRem(A, B) 

return Q := Q/S"' + q, R := r. 
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Figure 1.4: Time in 10~^ seconds for the multiplication (lower curve) of x 
words by 1000 — x words and for the division (upper curve) of 1000 words 
by X words, with GMP 5.0.0 on a Core 2 running at 2.83GHz. 

1.4.4 Newton's Method 

Newton's iteration gives the division algorithm with best asymptotic com- 
plexity. One basic component of Newton's iteration is the computation of 
an approximate inverse. We refer here to Chapter m The p-adic version 
of Newton's method, also called Hensel lifting, is used in §1.4.51 for exact 
division. 

1.4.5 Exact Division 

A division is exact when the remainder is zero. This happens, for example, 
when normalizing a fraction a/b: one divides both a and b by their greatest 
common divisor, and both divisions are exact. If the remainder is known 
a priori to be zero, this information is useful to speed up the computation 
of the quotient. Two strategies are possible: 

• use MSB (most significant bits first) division algorithms, without com- 
puting the lower part of the remainder. Here, one has to take care 
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of rounding errors, in order to guarantee the correctness of the final 
result; or 

• use LSB (least significant bits first) algorithms. If the quotient is known 
to be less than P^, computing a/b mod /3" will reveal it. 

Subquadratic algorithms can use both strategies. We describe a least sig- 
nificant bit algorithm using Hensel lifting, which can be viewed as a p-adic 
version of Newton's method: 

Algorithm 1.10 ExactDivision 

Input: A = Eo"' ^^/^N ^ = Eo"' b,/3^ 
Output: quotient Q = A/B mod 
Require: gcd(6o,/3) = 1 
1: C ^ 1/bo mod P 

2: for i from \lgn] — 1 downto 1 do 

3: \n/2'] 

4: C ^C + C{l-BC)modp'' 

5: Q ^ AC mod 

6: Q^Q + C{A - BQ) mod 



Algorithm ExactDivision uses the Karp-Markstein trick: lines [IM] com- 
pute 1/Smod f3^^^'^\ while the two last lines incorporate the dividend to 
obtain A/B mod (3"'. Note that the middle product ( §3.3.21) can be used in 
lines m and [ni to speed up the computation of 1—BC and A—BQ respectively. 

A further gain can be obtained by using both strategies simultaneously: 
compute the most significant n/2 bits of the quotient using the MSB strategy, 
and the least significant n/2 bits using the LSB strategy. Since a division of 
size n is replaced by two divisions of size n/2, this gives a speedup of up to 
two for quadratic algorithms (see Exercise 11.271) . 

1.4.6 Only Quotient or Remainder Wanted 

When both the quotient and remainder of a division are needed, it is best 
to compute them simultaneously. This may seem to be a trivial statement, 
nevertheless some high-level languages provide both div and mod, but no 
single instruction to compute both quotient and remainder. 
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Once the quotient is known, the remainder can be recovered by a single 
multiphcation as A — QB; on the other hand, when the remainder is known, 
the quotient can be recovered by an exact division as {A — R)/B ( §1.4.5p . 

However, it often happens that only one of the quotient or remainder is 
needed. For example, the division of two floating-point numbers reduces to 
the quotient of their significands (see Chapter E]). Conversely, the multipli- 
cation of two numbers modulo reduces to the remainder of their product 
after division by (see Chapter [2]). In such cases, one may wonder if faster 
algorithms exist. 

For a dividend of 2n words and a divisor of n words, a significant speedup 
— up to a factor of two for quadratic algorithms — can be obtained when 
only the quotient is needed, since one does not need to update the low n 
words of the current remainder (step [5] of Algorithm BasecaseDivRem). 

It seems difficult to get a similar speedup when only the remainder is 
required. One possibility is to use Svoboda's algorithm, but this requires 
some precomputation, so is only useful when several divisions are performed 
with the same divisor. The idea is the following: precompute a multiple Bi 
of B, having 3n/2 words, the n/2 most significant words being (3"'^'^. Then 
reducing A mod Bi requires a single n/2 x n multiplication. Once A is re- 
duced to Ai of 3n/2 words by Svoboda's algorithm with cost 2M(n/2), use 
RecursiveDivRem on Ai and B, which costs D{n/2) + M(n/2). The total 
cost is thus 3M(n/2) + D{n/2), instead of 2M(n/2) + 2D{n/2) for a full 
division with RecursiveDivRem. This gives 5M(n)/3 for Karatsuba and 
2.04M(n) for Toom-Cook 3-way, instead of 2M(n) and 2.63M(n) respec- 
tively. A similar algorithm is described in §2.4.21 (Subquadratic Montgomery 
Reduction) with further optimizations. 

1.4.7 Division by a Single Word 

We assume here that we want to divide a multiple precision number by a 
one-word integer c. As for multiplication by a one-word integer, this is an 
important special case. It arises for example in Toom-Cook multiplication, 
where one has to perform an exact division by 3 ( §1.3.31) . One could of course 
use a classical division algorithm ( §1.4.11) . When gcd(c, /3) = 1, Algorithm 
DivideByWord might be used to compute a modular division: 

A + 6/3" = eg, 

where the "carry" h will be zero when the division is exact. 
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Algorithm 1.11 DivideByWord 

Input: A = J2o~^ ^iP", < c < /3, gcd(c, /3) = 1 

Output: Q = qil3' and < 6 < c such that A + 6/3" = cQ 

1: d ^ 1/c mod /3 > might be precomputed 

2: b^O 

3: for i from to ?i — 1 do 

4: if 6 < tti then (x, b') ^ (oj — b, 0) 

5: else (x, b') ^ {ai-b + (3, 1) 

6: gj ^ mod /3 

7: b" ^ {q,c-x)//3 

8: b^b' + b" 

9: return Xlo"^'?*/^*' ^• 



Theorem 1.4.3 The output of Alg. DivideByWord satisfies A+b/S"" = cQ. 

Proof. We show that after step i, < i < n, we have Ai + b(3^~^^ = cQi, 

where Ai := Yl]=o (^il^' Qi ■= I^i=o ^il^'- ^ = 0, this is ao + 6/3 = ego, 
which is just line [3 since qo = ao/c mod /3, qoc — ao is divisible by /3. Assume 
now that Ai_i + 6/3* = cQj-i holds for 1 < z < n. We have — 6 + 6'/3 = x, 
so X + b"/3 = cqi, thus Ai + (6' + 6")/3*+^ = A~i + /3^(ai + 6'/3 + 6"/3) = 
cgi_i -b(3' + /3'ix + b- b'(3 + 6'/3 + b"/3) = cQi.i + /3^(x + 6"/3) = cQ^. □ 

Remark: at step [3, since < x < /3, 6" can also be obtained as [gic//3J . 

Algorithm DivideByWord is just a special case of Hensel's division, 
which is the topic of the next section; it can easily be extended to divide by 
integers of a few words. 

1.4.8 Hensel's Division 

Classical division involves cancelling the most significant part of the divi- 
dend by a multiple of the divisor, while Hensel's division cancels the least 
significant part (Figure [T75|) . Given a dividend A of 2n words and a divisor 
B of n words, the classical or MSB (most significant bit) division computes 
a quotient Q and a remainder R such that A = QB + R, while Hensel's or 
LSB (least significant bit) division computes a LSB-quotient Q' and a LSB- 
remainder R! such that A = Q'B + i?'/3". While MSB division requires the 
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Figure 1.5: Classical/MSB division (left) vs Hensel/LSB division (right). 

most significant bit of B to be set, LSB division requires B to be relatively 
prime to the word base /3, i.e., B to be odd for /3 a power of two. 

The LSB-quotient is uniquely defined by Q' = A/B mod (3"^, with 
< Q' < /S". This in turn uniquely defines the LSB-remainder 
R' = {A- Q'5)/3-", with -B <R' < 

Most MSB-division variants (naive, with preconditioning, divide and con- 
quer, Newton's iteration) have their LSB-counterpart. For example, LSB pre- 
conditioning involves using a multiple kB of the divisor such that 
kB = 1 mod P, and Newton's iteration is called Hensel lifting in the LSB 
case. The exact division algorithm described at the end of §1.4.51 uses both 
MSB- and LSB-division simultaneously. One important difference is that 
LSB-division does not need any correction step, since the carries go in the 
direction opposite to the cancelled bits. 

When only the remainder is wanted, Hensel's division is usually known 
as Montgomery reduction (see §2.4.2p . 



1.5 Roots 

1.5.1 Square Root 

The "paper and pencil" method once taught at school to extract square roots 
is very similar to "paper and pencil" division. It decomposes an integer m 
of the form + r, taking two digits of m at a time, and finding one digit of 
s for each two digits of m. It is based on the following idea. U m = s'^ + r 
is the current decomposition, then taking two more digits of the argument, 
we have a decomposition of the form 100m + r' = lOOs^ -I- lOOr -I- r' with 
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< r' < 100. Since (10s + t)^ = lOOs^ + 20st + t^, a good approximation to 
the next digit t can be found by dividing lOr by 2s. 

Algorithm SqrtRem generahzes this idea to a power of the internal 
base close to m^^^: one obtains a divide and conquer algorithm, which is in 
fact an error-free variant of Newton's method (c/ Chapter HJ: 

Algorithm 1.12 SqrtRem 

Input: m = a^.i/?""^ + ■ ■ ■ + + Oq with a„_i 7^ 
Output: (s, r) such that < m = + r < (s + 1)^ 
Require: a base-case routine BasecaseSqrtRem 

£^ [{n- 1)/4J 

if £ = then return BasecaseSqrtRem(m) 

write m = 03/?^^ -|- 02/9^^ + + ao with < 02, cti, cto < (^^ 
(s', r') ^ SqrtRem(a3/3^ + 02) 
(g, m) ^ DivRem(r'/3^ + ai, 2s') 
s ^ s'/3^ g 

if r < then 

r^r + 2s — 1, s-(— s — 1 
return (s, r). 



Theorem 1.5.1 Algorithm SqrtRem correctly returns the integer square 
root s and remainder r of the input m, and has complexity R(2n) ~ R{n) + 
D(n) + S{n) where D{n) and S{n) are the complexities of the division with 
remainder and squaring respectively. This gives R{n) ~ n'^/2 with naive 
multiplication, R{n) ~ 4K{n)/3 with Karatsuba's multiplication, assuming 
S{n) ~ 2M(n)/3. 

As an example, assume Algorithm SqrtRem is called on m = 123 456 789 
with /3 = 10. One has n = 9, ^ = 2, 03 = 123, 02 = 45, ai = 67, and = 89. 
The recursive call for a^fS^ + a2 = 12 345 yields s' = 111 and r' = 24. The 
DivRem call yields g = 11 and u = 25, which gives s = 11 111 and r = 2 468. 

Another nice way to compute the integer square root of an integer m, i.e., 
[m^/^J , is Algorithm Sqrtint, which is an all-integer version of Newton's 
method 

Still with input 123 456 789, we successively get s = 61 728 395, 30 864 198, 
15 432100, 7 716 053, 3 858 034, 1929 032, 964 547, 482 337, 241296, 120 903, 
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Algorithm 1.13 Sqrtint 



Input: an integer m > 1 
Output: s = [m^/^J 

1: u m 

2: repeat 

3: S ^ U 



> any value u > [m 



^/^J works 



4: t ^ s + [m./sj 
5: [t/2\ 
6: until u > s 
7: return s. 



60 962, 31493, 17 706, 12 339, 11 172, 11 111, 11 111. Convergence is slow 
because the initial value of u assigned at lined] is much too large. However, 
any initial value greater than or equal to [m^/^J works (see the proof of 
Algorithm Rootint below): starting from s = 12 000, one gets s = 11144 
then s = 11 111. See Exercise 11.281 

1.5.2 k-th Root 

The idea of Algorithm SqrtRem for the integer square root can be general- 
ized to any power: if the current decomposition is m = m'/J'^ + m"/3^~^ +m'", 
first compute a k-th root of m', say m' = s'^' + r, then divide rfi + m" by ks'^~^ 
to get an approximation of the next root digit t, and correct it if needed. Un- 
fortunately the computation of the remainder, which is easy for the square 
root, involves 0{k) terms for the /c-th root, and this method may be slower 
than Newton's method with floating-point arithmetic ( §4.2.3p . 

Similarly, Algorithm Sqrtint can be generalized to the k-th root (see 
Algorithm Rootint). 

Theorem 1.5.2 Algorithm 'R.ootlnt terminates and returns \m^/^\. 

Proof. As long as m < s in stepO the sequence of s- values is decreasing, thus 
it suffices to consider what happens when u> s. First it is easy so see that 
u> s implies m > s^, because t > ks thus {k — l)s -\-m/s''~^ > ks. Consider 
now the function f{t) := [{k — l)t + m/t^~^]/k for t > 0; its derivative is 
negative for t < m^^^, and positive for t > m^^*^, thus f{t) > f{m^^'^) = m^^^. 
This proves that s > [m^/^^J. Together with s < m^^'^, this proves that 
s = [m^/*^J at the end of the algorithm. g 
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Algorithm 1.14 Rootint 
Input: integers m > 1, and k >2 
Output: s = [m^^'^l 

1: M m 

2: repeat 

3: S ^ U 

4: t ^ {k - l)s + [m/s''-^\ 
5: [t/k\ 
6: until u > s 
7: return s. 



> any value u > [m^^''\ works 



Note that any initial value greater than or equal to [m-'^/'^J works at step [H 
Incidentally, we have proved the correctness of Algorithm Sqrtint, which is 
just the special case k = 2 oi Algorithm Rootint. 

1.5.3 Exact Root 

When a k-th root is known to be exact, there is of course no need to com- 
pute exactly the final remainder in "exact root" algorithms, which saves 
some computation time. However, one has to check that the remainder is 
sufficiently small that the computed root is correct. 

When a root is known to be exact, one may also try to compute it starting 
from the least significant bits, as for exact division. Indeed, if s'' = m, then 

= m mod for any integer £. However, in the case of exact division, 
the equation a = qb mod has only one solution q as soon as b is relatively 
prime to /3. Here, the equation = m mod {3^ may have several solutions, 
so the lifting process is not unique. For example, = 1 mod 2'^ has four 
solutions 1, 3, 5, 7. 

Suppose we have = m mod and we want to lift to This implies 
(s + tp^Y = fn + m'P^ mod where <t,m' < p. Thus 

. m — 
kt = m -\ — — mod p. 

P 

This equation has a unique solution t when k is relatively prime to /3. For 
example, we can extract cube roots in this way for /3 a power of two. When 
k is relatively prime to /3, we can also compute the root simultaneously from 
the most significant and least significant ends, as for exact division. 
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Unknown Exponent 

Assume now that one wants to check if a given integer m is an exact power, 
without knowing the corresponding exponent. For example, some primahty 
testing or factorization algorithms fail when given an exact power, so this has 
to be checked first. Algorithm IsPower detects exact powers, and returns 
the largest corresponding exponent (or 1 if the input is not an exact power). 



Algorithm 1.15 IsPower 
Input: a positive integer m 

Output: k > 2 when m is an exact k-th power, 1 otherwise 
1: for k from [IgmJ downto 2 do 
2: if m is a k-th power then return k 
3: return 1. 



To quickly detect non-/c-th powers at step El one may use modular algo- 
rithms when k is relatively prime to the base (3 (see above). 
Remark: in Algorithm IsPower, one can limit the search to prime expo- 
nents k, but then the algorithm does not necessarily return the largest expo- 
nent, and we might have to call it again. For example, taking m = 117649, 
the modified algorithm first returns 3 because 117649 = 49^, and when called 
again with m = 49 it returns 2. 

1.6 Greatest Common Divisor 

Many algorithms for computing gcds may be found in the literature. We can 
distinguish between the following (non-exclusive) types: 

• left-to-right (MSB) versus right-to-left (LSB) algorithms: in the former 
the actions depend on the most significant bits, while in the latter the 
actions depend on the least significant bits; 

• naive algorithms: these 0{n?) algorithms consider one word of each 
operand at a time, trying to guess from them the first quotients; we 
count in this class algorithms considering double-size words, namely 
Lehmer's algorithm and Sorenson's fc-ary reduction in the left-to-right 
and right-to-left cases respectively; algorithms not in this class consider 
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a number of words that depends on the input size n, and are often 
subquadratic; 

• subtraction-only algorithms: these algorithms trade divisions for sub- 
tractions, at the cost of more iterations; 

• plain versus extended algorithms: the former just compute the gcd of 
the inputs, while the latter express the gcd as a linear combination of 
the inputs. 



1.6.1 Naive GCD 

For completeness we mention Euclid's algorithm for finding the gcd of two 
non-negative integers u, v. 



Algorithm 1.16 EuchdGcd 
Input: u,v nonnegative integers (not both zero) 
Output: gcd(u, v) 
while V ^ do 

{u, v) ^ (f , u mod v) 
return u. 



Euclid's algorithm is discussed in many textbooks, and we do not recom- 
mend it in its simplest form, except for testing purposes. Indeed, it is usually 
a slow way to compute a gcd. However, Euclid's algorithm does show the 
connection between gcds and continued fractions, li u/v has a regular con- 
tinued fraction of the form 

1 1 1 

u/v = qQ + 



qi+ q2+ g3+ 

then the quotients qo,qi, ■ ■ ■ are precisely the quotients u div v of the divisions 
performed in Euclid's algorithm. For more on continued fractions, see §4.61 



Double-Digit Gcd. A first improvement comes from Lehmer's observa- 
tion: the first few quotients in Euclid's algorithm usually can be determined 
from the most significant words of the inputs. This avoids expensive divi- 
sions that give small quotients most of the time (see |143l §4.5.3]). Consider 
for example a = 427419 669 081 and b = 321 110 693 270 with 3-digit words. 
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The first quotients are 1,3,48,... Now if we consider the most significant 
words, namely 427 and 321, we get the quotients 1, 3, 35, .... If we stop after 
the first two quotients, we see that we can replace the initial inputs by a — 6 
and -3a + 46, which gives 106 308 975 811 and 2 183 765 837. 

Lehmer's algorithm determines cofactors from the most significant words 
of the input integers. Those cofactors usually have size only half a word. 
The DoubleDigitGcd algorithm — which should be called "double-word" 
— uses the two most significant words instead, which gives cofactors t, u, v, w 
of one full- word each, such that gcd(a, 6) = gcd(ta + ub,va + wb). This is 
optimal for the computation of the four products ta, ub, va, wb. With the 
above example, if we consider 427419 and 321 110, we find that the first five 
quotients agree, so we can replace a, b by —148a + 1976 and 441a — 5876, i.e., 
695 550 202 and 97115 231. 

Algorithm 1.17 DoubleDigitGcd 

Input: a := a„_i/3"~-^ + ■ ■ ■ + ao, 6 := 6m-i/5™^"^ + ■ ■ • + 6o 
Output: gcd(a, b) 

if 6 = then return a 

if m < 2 then return BasecaseGcd(a, b) 

it a < b OT n > m then return DoubleDigitGcd(6, a mod b) 

(t, u, f , w) ^ HalfBezout(a„_i/3 + an-2, bn-il3 + bn-2) 

return DoubleDigitGcd(|ta + ub\, \va + wb\). 



The subroutine HalfBezout takes as input two 2-word integers, performs 
Euclid's algorithm until the smallest remainder fits in one word, and returns 
the corresponding matrix [t,U]V,w]. 

Binary Gcd. A better algorithm than Euclid's, though also of 0{n^) com- 
plexity, is the binary algorithm. It differs from Euclid's algorithm in two 
ways: it consider least significant bits first, and it avoids divisions, except for 
divisions by two (which can be implemented as shifts on a binary computer). 
See Algorithm BinaryGcd. Note that the first three "while" loops can be 
omitted if the inputs a and b are odd. 

Sorenson's A;-ary reduction 

The binary algorithm is based on the fact that if a and b are both odd, 
then a — 6 is even, and we can remove a factor of two since gcd(a, b) is odd. 
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Algorithm 1.18 BinaryGcd 
Input: a,b > 
Output: gcd(a, b) 
t ^ 1 

while a mod 2 = b mod 2 = do 

{t,a,b) ^ (2t, a/2, 6/2) 

while a mod 2 = do 

a ^ a/2 

while b mod 2 = do 

bi-b/2 

while a ^ b do 

(a, b) ^ {\a — 6|, min(a, b)) 
a ^ a/2^('^) 

return ta. 



> now a and b are both odd 



> z/(a) is the 2-valuation of a 



Sorenson's fc-ary reduction is a generalization of that idea: given a and b 
odd, we try to find small integers u, v such that ua — vb is divisible by a large 
power of two. 

Theorem 1.6.1 |227] If a,b > 0, m > 1 with gcd(a,m) = gcd(6, m) = 1, 
there exist u,v, < f < y/m such that ua = vb mod m. 

Algorithm ReducedRatMod finds such a pair [u, v); it is a simple variation 
of the extended Euclidean algorithm; indeed, the Ui are quotients in the 
continued fraction expansion of c/m. 

When m is a prime power, the inversion 1 /b mod m at step[T]of Algorithm 
ReducedRatMod can be performed efficiently using Hensel lifting ( §2.5p . 

Given two integers a, b of say n words. Algorithm ReducedRatMod 
with m = P"^ returns two integers u, v such that vb — ua is a. multiple of 
Since u, v have at most one word each, a' = {vb — ua) / has at most n — 1 
words — plus possibly one bit — therefore with b' = b mod a' we obtain 
gcd(a, 6) = gcd(a',6'), where both a' and b' have about one word less than 
max(a,6). This gives an LSB variant of the double-digit (MSB) algorithm. 
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Algorithm 1.19 ReducedRatMod 

Input: a,b > 0, m > 1 with gcd(a, m) = gcd(6, m) = 1 

Output: {u, v) such that < w < ^/rn and ua = vb mod m 

1: c a/b mod m 

2: {ui,vi) ^ (0,m) 

3: {U2,V2) i- (l,c) 

4: while ^2 > ^/rn do 

5: g ^ KMJ 

6: (Mi,M2) ^ (m2,Mi - gM2) 

7: (fl,f2) ^ (^'2,t'l - gW2) 

8: return (^2, f2). 



1.6.2 Extended GCD 

Algorithm ExtendedGcd solves the extended greatest common divisor prob- 
lem: given two integers a and b, it computes their gcd g, and also two integers 
u and V (called Bezout coefficients or sometimes cof actors or multipliers) such 
that g = ua + vb. 

Algorithm 1.20 ExtendedGcd 
Input: positive integers a and b 

Output: integers u, v) such that g = gcd(a, b) = ua + vb 

1: {U,W)^ (1,0) 

2: {v,x) ^ (0,1) 

3: while 6 7^ do 

4: (g, r) ^ DivRem(a, 6) 

5: (a, 6) ^ (6, r) 

6: {u,w)-i^{w,u — qw) 

7: (w, x) ^ (x, w — gx) 

8: return (a, m, t'). 



If and 6o are the input numbers, and a, 6 the current values, the follow- 
ing invariants hold at the start of each iteration of the while loop and after 
the while loop: a = uao + vbo, and b = wao + xbo- (See Exercise 11.301 for a 
bound on the cofactor u.) 

An important special case is modular inversion (see Chapter |2]): given an 
integer n, one wants to compute 1/a mod n for a relatively prime to n. One 
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then simply runs Algorithm ExtendedGcd with input a and b = n: this 
yields u and v with ua + vn = 1, thus 1/a = u mod n. Since v is not needed 
here, we can simply avoid computing v and x, by removing steps [2] and [71 

It may also be worthwhile to compute only u in the general the 
cofactor v can be recovered from v = {g — ua)/b, this division being exact 
(see 01X5]) . 

All known algorithms for subquadratic gcd rely on an extended gcd sub- 
routine which is called recursively, so we discuss the subquadratic extended 
gcd in the next section. 

1.6.3 Half Binary GCD, Divide and Conquer GCD 

Designing a subquadratic integer gcd algorithm that is both mathematically 
correct and efficient in practice is a challenging problem. 

A first remark is that, starting from ra-bit inputs, there are 0{n) terms in 
the remainder sequence Vq = a, ri = b, . . . , rj+i = rj_i mod rj, . . . , and the 
size of Tj decreases linearly with i. Thus, computing all the partial remainders 
ri leads to a quadratic cost, and a fast algorithm should avoid this. 

However, the partial quotients = rj_i div are usually small: the main 
idea is thus to compute them without computing the partial remainders. This 
can be seen as a generalization of the DoubleDigitGcd algorithm: instead 
of considering a fixed base /3, adjust it so that the inputs have four "big 
words". The cofactor-matrix returned by the HalfBezout subroutine will 
then reduce the input size to about 3n/4. A second call with the remaining 
two most significant "big words" of the new remainders will reduce their size 
to half the input size. See Exercise 11.311 

The same method applies in the LSB case, and is in fact simpler to turn 
into a correct algorithm. In this case, the terms form a binary remainder 
sequence, which corresponds to the iteration of the BinaryDivide algorithm, 
with starting values a, b. 

The integer q is the binary quotient of a and b, and r is the binary re- 
mainder. 

This right-to-left division defines a right-to-left remainder sequence Oq = 

a, ai = b, . . . , where Oj+i = Binary Remainder (ai_i, Oj), and z/(ai+i) < 
p{ai). It can be shown that this sequence eventually reaches Oj+i = for 
some index i. Assuming i/(a) = 0, then gcd(a, b) is the odd part of Oj. 
Indeed, in Algorithm BinaryDivide, if some odd prime divides both a and 

b, it certainly divides 2~^b which is an integer, and thus it divides a + q2~^b. 
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Algorithm 1.21 BinaryDivide 
Input: a,h E'L with v[h) — v{a) = j > 
Output: |g| < 2-' and r = a + q2^^b such that < z^(r) 
b' ^ 2~ih 

q i a/h' mod 2-''+^ 

if g > 2^ then q- 2^+^ 
return g, r = a + q2^^b. 



Conversely, if some odd prime divides both b and r, it divides also 2~^b, thus 
it divides a = r — q2~^b; this shows that no spurious factor appears, unlike 
in some other gcd algorithms. 

Example: let a = ao = 935 and b = ai = 714, so i^{b) = i^{a)+l. Algorithm 
BinaryDivide computes 6' = 357, q = 1, and 02 = a + q2~^b = 1292. The 
next step gives 03 = 1360, then = 1632, = 2176, = 0. Since 2176 = 
2^ ■ 17, we conclude that the gcd of 935 and 714 is 17. Note that the binary 
remainder sequence might contain negative terms and terms larger than a, b. 
For example, starting from a = 19 and 6 = 2, we get 19, 2, 20, —8, 16, 0. 

An asymptotically fast GCD algorithm with complexity 0(M(r;,) log n) 
can be constructed with Algorithm HalfBinaryGcd. 

Theorem 1.6.2 Given a,b with i/(a) = and v{b) > 0, and an integer 
k > 0, Algorithm HalfBinaryGcd returns an integer < j < k and a 
matrix R such that, if c = 2~^-^ (i?i la + -Ri_2&) and d = 2~^-' (i?2,i'2 + -R2,2^)-' 

1. c and d are integers with z/(c) = and u{d) > 0; 

2. c* = 2^c and d* = 2^d are two consecutive terms from the binary re- 
mainder sequence ofa,b with z^(c*) < k < u{d*). 

Proof. We prove the theorem by induction on k. U k = 0, the algorithm 
returns j = and the identity matrix, thus we have c = a and d = b, and 
the statement is true. Now suppose k > 0, and assume that the theorem is 
true up to A; — 1. 

The first recursive call uses ki < k, since ki = lk/2\ < k. After step [5], 
by induction a[ = 2~'^^^{Ri^iai + Ri^^bi) and b\ = 2~^-'i(i?2,iai + -R2,2&i) are 
integers with z/(a']^) = < iy{b'i), and 2^^a'i,2^^b[ are two consecutive terms 
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Algorithm 1.22 HalfBinaryGcd 

Input: a,b E 'Z with = u{a) < v{h), a non-negative integer k 
Output: an integer j and a 2 x 2 matrix R satisfying Theorem 11.6.21 
1: if v{h) > k then 



3: ki ^ [k/2\ 

4: ai^ a mod 22'=i+i, bi ^ b mod 2'^''^+'^ 

5: ji, i? ^ HalfBinaryGcd (ai, 6i, A;i) 

6: a' ^ 2-2^1 (i?i,ia + Ri^^b), b' ^ 2-^^^{R2,ia + i?2,2&) 

7: Jo ^ '^{b') 

8: if jo + ji > k then 

9: return 

10: q,r ^ BinaryDivide(a', 6') 

11: k2 ^ fc- (jo+Ji) 

12: 02 ^ 672^0 mod 2^^2+1^ ^ r/2-'« mod 22^=2+1 

13: j2, S ^ HalfBinaryGcd (a2, ^2) 



2: 



return 
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from the binary remainder sequence of ai, 61. Lemma 7 of [209] says that the 
quotients of the remainder sequence of a,b coincide with those of ai,bi up 
to 2^^ a' and 2^^b'. This proves that 2^^a',2^^b' are two consecutive terms of 
the remainder sequence of a, b. Since a and ai differ by a muhiple of 2^'^!"'"^, 
a' and a'l differ by a multiple of 2'^'^^^^"^^'^ > 2 since ji < ki by induction. 
It follows that z^(a') = 0. Similarly, b' and 6'^^ differ by a multiple of 2, thus 
Jo = > 0. 

The second recursive call uses k2 < k, since by induction ji > and we 
just showed jo > 0. It easily follows that ji + jo + j2 > 0, and thus j > 0. 
If we exit at step [9l we have j = ji < ki < k. Otherwise j = ji + jo + j2 = 
k — k2 + j2 ^ k hj induction. 

If jo + ji > ^1 we have i/{2^^b') = jo + ji > k, we exit the algorithm and 
the statement holds. Now assume jo + ji < k. We compute an extra term 
r of the remainder sequence from a', b', which up to multiplication by 2^^, is 
an extra term of the remainder sequence of a, b. Since r = a' + q2~^°b', we 
have 

b'\ _ f 2^» \ f a' 



2^0 q J \ b' 

The new terms of the remainder sequence are &'/2-^° and r/2^°, adjusted 
so that v{b' /2^") = 0. The same argument as above holds for the second 
recursive call, which stops when the 2- valuation of the sequence starting from 
a2, &2 exceeds /C2; this corresponds to a 2-valuation larger than jo+ ji + ^2 = k 
for the a, b remainder sequence. □ 

Given two n-bit integers a and b, and k = n/2, HalfBinaryGcd yields 
two consecutive elements c*, d* of their binary remainder sequence with bit- 
size about n/2 (for their odd part). 

Example: let a = 1889 826 700 059 and b = 421872 857844, with k = 20. 
The first recursive call with ai = 1243 931, bi = 1372 916, ki = 10 gives 
ji = 8 and i? = ^ If^ ), which corresponds to a' = 11 952 871 683 and 

b' = 10 027328112, with jo = 4. The binary division yields the new term 
r = 8 819 331 648, and we have ^2 = 8, = 52 775, 62 = 50 468. The second 
recursive call gives j2 = 8 and S = ( ^[^3 ) 5 which finally gives j = 20 



and the matrix ^ ^-^^qq^^ \ 023 711 ) ' which corresponds to the remainder terms 
rg = 2 899 749 ■ 2^ rg = 992 790 ■ 2K With the same a,b values, but with 
k = 41, which corresponds to the bit-size of a, we get as final values of the 
algorithm ris = 3 ■ 2^^ and rig = 0, which proves that gcd(a, b) = 3. 
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Let H{n) be the complexity of HalfBinaryGcd for inputs of n bits and 
k = n/2; ai and bi have ~n/2 bits, the coefficients of R have r^n/A bits, and 
a', b' have ~3n/4 bits. The remainders 02, &2 have ~?t,/2 bits, the coefficients 
of have ~ n/4 bits, and the final values c, (i have ~ n/2 bits. The main 
costs are the matrix- vector product at step El and the final matrix-matrix 
product. We obtain i7(n) ~ 2i7(n/2) + 4M(n/4, n) + 7M(n/4), assuming we 
use Strassen's algorithm to multiply two 2x2 matrices with 7 scalar products, 
i.e., H{n) ~ 2if(?7,/2)+17M(?7./4), assuming that we compute each M(n/4, n) 
product with a single FFT transform of width 5n/4, which gives cost about 
M(5n/8) ~ 0.625M(n) in the FFT range. Thus H{n) = 0{M{n) logn). 

For the plain gcd, we call HalfBinaryGcd with k = n, and instead 
of computing the final matrix product, we multiply 2~'^^^S by {b',r) — the 
components have ~n/2 bits — to obtain the final c,d values. The first 
recursive call has ai,6i of size n with ki ~ n/2, and corresponds to H{n); 
the matrix R and a', b' have n/2 bits, and k2 ~ n/2, thus the second recursive 
call corresponds to a plain gcd of size n/2. The cost G(n) satisfies G{n) = 
H{n) + G(n/2) + 4M(n/2,n) + 4M(n/2) ~ H{n) + G(n/2) + 10M(n/2). 
Thus G{n) = 0(M(n) logn). 

An application of the half gcd per se in the MSB case is the rational 
reconstruction problem. Assume one wants to compute a rational p/q where 
p and q are known to be bounded by some constant c. Instead of comput- 
ing with rationals, one may perform all computations modulo some integer 
n > c^. Hence one will end up with p/q = m mod n, and the problem is 
now to find the unknown p and q from the known integer m. To do this, one 
starts an extended gcd from m and n, and one stops as soon as the current 
a and u values — as in ExtendedGcd — are smaller than c: since we have 
a = um + vn, this gives m = a/u mod n. This is exactly what is called a 
half-gcd; a subquadratic version in the LSB case is given above. 

1.7 Base Conversion 

Since computers usually work with binary numbers, and human prefer deci- 
mal representations, input/output base conversions are needed. In a typical 
computation, there are only a few conversions, compared to the total num- 
ber of operations, so optimizing conversions is less important than optimizing 
other aspects of the computation. However, when working with huge num- 
bers, naive conversion algorithms may slow down the whole computation. 
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In this section we consider that numbers are represented internally in base 
P — usually a power of 2 — and externally in base B — say a power of 10. 
When both bases are commensurable, i.e., both are powers of a common inte- 
ger, like /3 = 8 and B = 16, conversions of n-digit numbers can be performed 
in 0{n) operations. We assume here that /3 and B are not commensurable. 

One might think that only one algorithm is needed, since input and output 
are symmetric by exchanging bases /3 and B. Unfortunately, this is not true, 
since computations are done only in base /3 (see Exercise ll.37p . 

1.7.1 Quadratic Algorithms 

Algorithms Integerlnput and IntegerOutput respectively read and write 
n-word integers, both with a complexity of O(n^). 

Algorithm 1.23 Integerlnput 

Input: a string S = Sm~i ■ ■ ■ siSq of digits in base B 
Output: the value A in base /3 of the integer represented by S 
A^O 

for i from m — 1 downto do 

A ^ BA + val(sj) > val(sj) is the value of Si in base P 

return A. 



Algorithm 1.24 IntegerOutput 
Input: A = ^iP'' > 

Output: a string S of characters, representing A in base B 
m ^ 

while A^O do 

Sm char(A mod B) > Sm- character corresponding to A mod B 
A<- Adiv B 
m m + 1 
return S = s^-i • • • SiSq. 



1.7.2 Subquadratic Algorithms 

Fast conversions routines are obtained using a "divide and conquer" strategy. 
Given two strings s and t, we let s\\t denote the concatenation of s and t. 
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For integer input, if the given string decomposes as = S'hi || 5*10 where 5*10 
has k digits in base B, then 

Input(5', B) = Input(S'hi, B)B^ + Input(S'io, B), 

where Input (S*, B) is the value obtained when reading the string S in the ex- 
ternal base B. Algorithm Fastlntegerlnput shows one way to implement 
this: if the output A has n words, Algorithm Fastlntegerlnput has com- 



Algorithm 1.25 Fastlntegerlnput 
Input: a string S = Sm-i ■ ■ ■ siSq of digits in base B 
Output: the value A of the integer represented by S 
£ ^ [val(so), val(si), . . . , val(sm-i)] 

(fe, k) ^ {B, m) > Invariant: £ has k elements ^o, • • • , ^fe-i 

while A; > 1 do 

if k even then £ ^ [4 + £2 + ^4, • • • , 4-2 + 64-i] 
else£4- [4 + ^4, 4 + ^4, •• -,4-1] 

{b,k)^{b\ \k/2\) 

return £o- 



plexity 0(M(n) logn), more precisely ~M(n/4) Ign for n a power of two in 
the FFT range (see Exercise 11.341) . 

For integer output, a similar algorithm can be designed, replacing multi- 
plications by divisions. Namely, ii A = Ahi-B^ + Aio, then 

Output(A, B) = Output(Ahi, B) II Output(Aio, B), 

where Output (A, B) is the string resulting from writing the integer A in the 
external base B, and it is assumed that Output(74io, B) has exactly k digits, 
after possibly padding with leading zeros. 

If the input A has n words. Algorithm FastlntegerOutput has com- 
plexity 0(M(n)logn), more precisely ~D(n/4)lgn for n a power of two 
in the FFT range, where D{n) is the cost of dividing a 2n-word integer by 
an 77,-word integer. Depending on the cost ratio between multiplication and 
division, integer output may thus be from 2 to 5 times slower than integer 
input; see however Exercise 11.351 
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Algorithm 1.26 Fastlnteger Output 
Input: A = X]o~^«i/5* 

Output: a string S of cliaracters, representing A in base B 
ii A< B then 

return char (A) 
else 

find k such that B^''-^ < A < B^'' 

(Q, R) ^ DivRem(A, fi^) 

r ^ Fastlnteger Output (i?) 

return FastIntegerOutput((5) || 0^"''^'^'^^^ || r. 



1.8 Exercises 

Exercise 1.1 Extend tiie Kronecker-Sciionhage trick mentioned at the beginning 
of ^1.31 to negative coefficients, assuming the coefficients are in the range [—p,p]. 

Exercise 1.2 (Harvey |114j ) For multiplying two polynomials of degree less than 
n, with non-negative integer coefficients bounded above by p, the Kronecker- 
Schonhage trick performs one integer multiplication of size about 2n Ig p, assuming 
n is small compared to p. Show that it is possible to perform two integer multipli- 
cations of size n Ig p instead, and even four integer multiplications of size (n/2) Ig p. 

Exercise 1.3 Assume your processor provides an instruction f maa(a, 6, c, d) re- 
turning h, i such that ab + c + d = h(3 + I where < a, 6, c, d,i,h < (3. Rewrite 
Algorithm BasecaseMultiply using fmaa. 

Exercise 1.4 (Harvey, Khachatrian et a/. |139j ) For A = YIIIZq o-iP^ and B = 
Sj=o ^i/^N prove the formula: 

n—1 1 n—1 n—1 n—1 

AB = Y, + + ^i)/^'^' + 2 E "i^*/^'' -Y^f^'Yl 

i=l j=0 i=0 i=0 j=0 

Deduce a new algorithm for schoolbook multiplication. 

Exercise 1.5 (Hanrot) Prove that the number K{n) of word products (as de- 
fined in the proof of Thm. I1.3.2P in Karatsuba's algorithm is non-decreasing, pro- 
vided no = 2. Plot the graph of K{n)/'n}^^ with a logarithmic scale for n, for 
2'' < n < 2^'^, and find experimentally where the maximum appears. 
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Exercise 1.6 (Ryde) Assume the basecase multiply costs M{n) = an? + bn, and 
that Karatsuba's algorithm costs K{n) = 3K(n/2) + cn. Show that dividing a by 
two increases the Karatsuba threshold no by a factor of two, and on the contrary 
decreasing b and c decreases no- 

Exercise 1.7 (Maeder |158| . Thome |216j ) Show that an auxiliary memory of 
2n + o{n) words is enough to implement Karatsuba's algorithm in-place, for an re- 
word xn- word product. In the polynomial case, prove that an auxiliary space of n 
coefficients is enough, in addition to the n + n coefficients of the input polynomials, 
and the 2n — 1 coefficients of the product. [You can use the 2n result words, but 
must not destroy the n + n input words.] 

Exercise 1.8 (Roche |191j ) If Exercise 1 1.71 was too easy for you, design a Karat- 
suba-like algorithm using only 0(logn) extra space (you are allowed to read and 
write in the 2n output words, but the n + n input words are read-only). 

Exercise 1.9 (Quercia, McLaughlin) Modify Algorithm KaratsubaMuItiply 

to use only ~7n/2 additions/subtractions. [Hint: decompose each of Co, Ci and 
C2 into two parts.] 

Exercise 1.10 Design an in-place version of KaratsubaMuItiply (see Exer- 
cise [L7]) that accumulates the result in cq, . . . ,Cn-i, and returns a carry bit. 

Exercise 1.11 (Vuillemin) Design an algorithm to multiply a2X^ + aix + ao by 
hix + bo using 4 multiplications. Can you extend it to a 6 x 6 product using 16 
multiplications? 

Exercise 1.12 (Weimerskirch, Paar) Extend the Karatsuba trick to compute 
an n X n product in n{n + l)/2 multiplications. For which n does this win over 
the classical Karatsuba algorithm? 

Exercise 1.13 (Hanrot) In Algorithm OddEvenKaratsuba, if both m and n 
are odd, one combines the larger parts Aq and -Bo together, and the smaller parts 
Ai and Bi together. Find a way to get instead: 

K{m,n) = K{\m/2], [n/2\) + K{[m/2\, \n/2]) + K{\m/2], \n/2]). 

Exercise 1.14 Prove that if 5 integer evaluation points are used for Toom-Cook 
3- way ( §1.3.3p . the division by (a multiple of) 3 can not be avoided. Does this 
remain true if only 4 integer points are used together with 00? 
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Exercise 1.15 (Quercia, Harvey) In Toom-Cook 3-way ( §1.3.3p . take as eval- 
uation point 2^ instead of 2, where w is the number of bits per word (usually 

1/7 = 32 or 64). Which division is then needed? Similarly for the evaluation point 
2^/2 

Exercise 1.16 For an integer k >2 and multiplication of two numbers of size kn 
and re, show that the trivial strategy which performs k multiplications, each nxn, 
is not the best possible in the FFT range. 

Exercise 1.17 (Karatsuba, Zuras |236| ) Assuming the multiplication has su- 
perlinear cost, show that the speedup of squaring with respect to multiplication 
can not significantly exceed 2. 

Exercise 1.18 (Thome, Quercia) Consider two sets A = {a, b,c, . . .} and U = 
{u,v,w, . . .}, and a set X = {x,y,z, . . .} of sums of products of elements of A 
and U (assumed to be in some field F). We can ask "what is the least number of 
multiplies required to compute all elements of X?" . In general, this is a difficult 
problem, related to the problem of computing tensor rank, which is NP-complete 
(see for example Hastad [119] and the book by Biirgisser et al. [59]). Special 
cases include integer /polynomial multiplication, the middle product, and matrix 
multiplication (for matrices of fixed size). As a specific example, can we compute 
X = au + cw, y = av + bw, z = bu + cv in fewer than 6 multiplies? Similarly for 
X = au — cw, y = av — bw, z = bu — cv. 

Exercise 1.19 In Algorithm BasecaseDivRem ( ^1.4.ip . prove that Qj < (3 + I. 
Can this bound be reached? In the case q* > (3, prove that the while-loop at steps 
[6][8]is executed at most once. Prove that the same holds for Svoboda's algorithm, 
i.e., that A > after step [8] of Algorithm SvobodaDivision ( §1.4.2p . 

Exercise 1.20 (Granlund, Moller) In Algorithm BasecaseDivRem, estimate 
the probability that vl < is true at step [U assuming the remainder Vj from 
the division of an+j(3 + an+j-i by bn-i is uniformly distributed in [0,6„_i — 1], 
A mod Z?""'"-'"-^ is uniformly distributed in [0,f3^^^~^ — 1], and B mod is uni- 
formly distributed in [0, — 1]. Then replace the computation of Qj by a division 
of the three most significant words of A by the two most significant words of B. 
Prove the algorithm is still correct. What is the maximal number of corrections, 
and the probability that ^ < 0? 

Exercise 1.21 (Montgomery [172j ) Let < 6 < /3, and < 04,..., oq < /3. 
Prove that 04 (/3^ mod 6) + • • • + ai(/3 mod b) + < provided b < (3/3. Use 
this fact to design an efficient algorithm dividing A = an~i(3^~^ + • • • + ao by 6. 
Does the algorithm extend to division by the least significant digits? 
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Exercise 1.22 In Algorithm RecursiveDivRem, find inputs that require 1, 2, 
3 or 4 corrections in step[8j [Hint: consider /3 = 2.] Prove that when n = m and 
A < f3"^{B + 1), at most two corrections occur. 

Exercise 1.23 Find the complexity of Algorithm RecursiveDivRem in the 
FFT range. 

Exercise 1.24 Consider the division of A of kn words by i? of n words, with 
integer A; > 3, and the alternate strategy that consists of extending the divisor with 
zeros so that it has half the size of the dividend. Show that this is always slower 
than Algorithm UnbalancedDivision [assuming that division has superlinear 
cost] . 

Exercise 1.25 An important special base of division is when the divisor is of the 
form b^. For example, this is useful for an integer output routine ( §1.7p . Can one 
design a fast algorithm for this case? 

Exercise 1.26 (Sedoglavic) Does the Kronecker-Schonhage trick to reduce poly- 
nomial multiplication to integer multiplication ( ^1.3p also work — in an efficient 
way — for division? Assume that you want to divide a degree-2n polynomial A{x) 
by a monic degree-n polynomial B{x), both polynomials having integer coefficients 
bounded by p. 

Exercise 1.27 Design an algorithm that performs an exact division of a 4n-bit 
integer by a 2?i-bit integer, with a quotient of 2n bits, using the idea mentioned 
in the last paragraph of ^1.4.51 Prove that your algorithm is correct. 

Exercise 1.28 Improve the initial speed of convergence of Algorithm Sqrtint 
( ^1.5. ip by using a better starting approximation at step [TJ Your approximation 
should be in the interval [[y^J, [2-^/m]]. 

Exercise 1.29 (Luschny) Devise a fast algorithm for computing the binomial 
coefficient 



for integers n, k, < k < n. The algorithm should use exact integer arithmetic 
and compute the exact answer. 

Exercise 1.30 (Shoup) Show that in Algorithm ExtendedGcd, if a > 6 > 0, 

and g = gcd(a, 6), then the cofactor u satisfies —b/{2g) <u< b/{2g). 
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Exercise 1.31 (a) Devise a subquadratic GCD algorithm HalfGcd along the 
lines outlined in the first three paragraphs of ^1.6.31 (most-significant bits first). 
The input is two integers a >b > 0. The output is a 2 x 2 matrix R and integers 
a' , b' such that [a' 6']* = R[a 6]*. If the inputs have size n bits, then the elements 
of R should have at most ri/2 + 0(1) bits, and the outputs a', b' should have at 
most 3re/4-|-0(l) bits, (b) Construct a plain GCD algorithm which calls HalfGcd 
until the arguments are small enough to call a naive algorithm, (c) Compare this 
approach with the use of HalfBinaryGcd in ^1.6.3[ 

Exercise 1.32 (Galbraith, Schonhage, Stehle) The Jacobi symbol (a|6) of 
an integer a and a positive odd integer b satisfies {a\b) = (a mod b\b), the law of 
quadratic reciprocity {a\b)(b\a) = (— l)("~i)(''"i)/^ for a odd and positive, 
together with (-1|6) = (-l)(^-i)/2, and (2|6) = (-l)(^'-i)/8. This looks very 
much like the gcd recurrence: gcd(a, b) = gcd(o mod 6, b) and gcd(a, b) = gcd(6, a). 
Can you design an 0(M(n) logn) algorithm to compute the Jacobi symbol of two 
n-bit integers? 

Exercise 1.33 Show that B and /? are commensurable, in the sense defined in 
im iff ln(B)/ln(/3) gQ. 

Exercise 1.34 Find a formula T(n) for the asymptotic complexity of Algorithm 
Fastlntegerlnput when n = 2^ ( ^1.7.2p . Show that, for general n, your formula 
is within a factor of two of T{n). [Hint: consider the binary expansion of n.] 

Exercise 1.35 Show that the integer output routine can be made as fast (asymp- 
totically) as the integer input routine Fastlntegerlnput. Do timing experiments 
with your favorite multiple-precision software. [Hint: use D. Bernstein's scaled 
remainder tree [21] and the middle product.] 

Exercise 1.36 If the internal base /3 and the external base B share a nontrivial 
common divisor — as in the case /3 = 2^ and S = 10 — show how one can exploit 
this to speed up the subquadratic input and output routines. 

Exercise 1.37 Assume you are given two n-digit integers in base ten, but you 
have fast arithmetic only in base two. Can you multiply the integers in time 
0(M(n))? 
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1.9 Notes and References 

"On-line" (as opposed to "off-line" ) algorithms are considered in many books and 
papers, see for example the book by Borodin and El-Yaniv [33]. "Relaxed" algo- 
rithms were introduced by van der Hoeven. For references and a discussion of the 
differences between "lazy", "zealous" and "relaxed" algorithms, see jl24j . 

An example of an implementation with "guard bits" to avoid overflow problems 
in integer addition ( ^l.'2|) is the block- wise modular arithmetic of Lenstra and Dixon 
on the MasPar [87]. They used /3 = 2^0 with 32-bit words. 

The observation that polynomial multiplication reduces to integer multiplica- 
tion is due to both Kronecker and Schonhage, which explains the name "Kronecker- 
Schonhage trick". More precisely, Kronecker |147| pp. 941-942] (also |148l §4]) 
reduced the irreducibility test for factorization of multivariate polynomials to the 
univariate case, and Schonhage [197] reduced the univariate case to the integer case. 
The Kronecker-Schonhage trick is improved in Harvey |114j (see Exercise 1 1.2p . and 
some nice applications of it are given in Steel |207| . 

Karatsuba's algorithm was first published in [136] . Very little is known about 
its average complexity. What is clear is that no simple asymptotic equivalent can 
be obtained, since the ratio K{n)/n°' does not converge (see Exercise II. 5p . 

Andrei Toom [2T8] discovered the class of Toom-Cook algorithms, and they were 
discussed by Stephen Cook in his thesis [76l pp. 51-77]. A very good description of 
these algorithms can be found in the book by Crandall and Pomerance |8H §9.5.1]. 
In particular it describes how to generate the evaluation and interpolation formulae 
symbolically. Zuras |236j considers the 4-way and 5-way variants, together with 
squaring. Bodrato and Zanoni |31j show that the Toom-Cook 3-way interpolation 
scheme of §1.3.31 is close to optimal for the points 0, 1, — 1, 2, oo; they also exhibit 
efficient 4-way and 5-way schemes. Bodrato and Zanoni also introduced the Toom- 
2.5 and Toom-3.5 notations for what we call Toom-(3,2) and Toom-(4, 3), these 
algorithms being useful for unbalanced multiplication using a different number 
of pieces. They noticed that Toom-(4, 2) only differs from Toom 3-way in the 
evaluation phase, thus most of the implementation can be shared. 

The Schonhage-Strassen algorithm first appeared in |200j . and is described in 
§2.3.31 Algorithms using floating-point complex numbers are discussed in Knuth's 
classic [Hal §4.3.3.C]. See also §33T1 

The odd-even scheme is described in Hanrot and Zimmermann |112j . and was 
independently discovered by Andreas Enge. The asymmetric squaring formula 
given in ^1.3.61 was invented by Chung and Hasan (see their paper |66j for other 
asymmetric formulae). Exercise 11.41 was suggested by David Harvey, who indepen- 
dently discovered the algorithm of Khachatrian et al. [139j . 

See Lefevre |153j for a comparison of different algorithms for the problem of 
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multiplication by an integer constant. 

Svoboda's algorithm was introduced in |212j . The exact division algorithm 
starting from least significant bits is due to Jebelean |130j . Jebelean and Krandick 
invented the "bidirectional" algorithm |145j . The Karp-Markstein trick to speed 
up Newton's iteration (or Hensel lifting over p-adic numbers) is described in |138j . 
The "recursive division" of ^1.4.3l is from Burnikel and Ziegler [60j . although earlier 
but not-so-detailed ideas can be found in Jebelean [132] . and even earlier in Moenck 
and Borodin [167| . The definition of Hensel's division used here is due to Shand 
and Vuillemin |202j . who also point out the duality with Euclidean division. 

Algorithm SqrtRem ( §1.5. ip was first described in Zimmermann |235j . and 
proved correct in Bertot et al. j29j . Algorithm Sqrtint is described in |73] : its 
generalization to fc-th roots (Algorithm Rootint) is due to Keith Briggs. The 
detection of exact powers is discussed in Bernstein, Lenstra and Pila [23] and 
earlier in Bernstein [17] and Cohen [73]. It is necessary, for example, in the AKS 
primality test [2]. 

The classical (quadratic) Euclidean algorithm has been considered by many 
authors — a good reference is Knuth [143] . The Gauss-Kuz'min theorenQ gives 
the distribution of quotients in the regular continued fraction of almost all real 
numbers, and hence is a good guide to the distribution of quotients in the Euclidean 
algorithm for large, random inputs. Lehmer's original algorithm is described in 
[155] . The binary gcd is almost as old as the classical Euclidean algorithm — 
Knuth [143] has traced it back to a first-century AD Chinese text Chiu Chang 
Suan Shu (see also Mikami [166] ). It was rediscovered several times in the 20th 
century, and it is usually attributed to Stein [210] . The binary gcd has been 
analysed by Brent [MHSU], Knuth [US], Maze [l60] and Vallee [222]. A parallel 
(systolic) version that runs in 0{n) time using 0{n) processors was given by Brent 
and Kung [53] . 

The double-digit gcd is due to Jebelean [131] . The A:-ary gcd reduction is due 
to Sorenson |206] . and was improved and implemented in GNU MP by Weber. 
Weber also invented Algorithm ReducedRatMod [227] , inspired by previous 
work of Wang. 

The first subquadratic gcd algorithm was published by Knuth [142] . but his 
complexity analysis was suboptimal — he gave 0(nlog^ nloglogn). The correct 
complexity 0(n log^ n log log n) was given by Schonhage [196] : for this reason the 
algorithm is sometimes called the Knuth-Schonhage algorithm. A description for 
the polynomial case can be found in Aho, Hopcroft and Ullman [3], and a detailed 
(but incorrect) description for the integer case in Yap [233] . The subquadratic 
binary gcd given in §1.6.31 is due to Stehle and Zimmermann [209] . Moller [169] 



^According to the Gauss-Kuz'min theorem [140] . the probability of a quotient q G N* 
is lg(l + l/g)-lg(l + l/(g+l)). 
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compares various subquadratic algorithms, and gives a nice algorithm without 
"repair steps" . 

Several authors mention an 0(n log^ n log log n) algorithm for the computation 
of the Jacobi symbol \89\ I201j . The earliest reference that we know is a paper by 
Bach [8], which gives the basic idea (due to Gauss |101l p. 509]). Details are given 
in the book by Bach and Shallit O Solution of Exercise 5.52], where the algorithm 
is said to be "folklore", with the ideas going back to Bachmann |10] and Gauss. 
The existence of such an algorithm is mentioned in Schonhage's book |199t §7.2.3], 
but without details. See also Exercise 11.321 



Chapter 2 

Modular Arithmetic and the 
FFT 

In this chapter our main topic is modular arithmetic, i.e., how 
to compute efficiently modulo a given integer A^. In most appli- 
cations, the modulus is fixed, and special-purpose algorithms 
benefit from some precomputations, depending only on A^, to 
speed up arithmetic modulo A^. 

There is an overlap between Chapter [1] and this chapter. For 
example, integer division and modular multiplication are closely 
related. In Chapter [T] we present algorithms where no (or only 
a few) precomputations with respect to the modulus A^ are per- 
formed. In this chapter we consider algorithms which benefit 
from such precomputations. 

Unless explicitly stated, we consider that the modulus A^ occupies 
n words in the word-base /3, i.e., /3"'~^ < N < /?"■. 

2.1 Representation 

We consider in this section the different possible representations of residues 
modulo A^. As in Chapter [H we consider mainly dense representations. 

2.1.1 Classical Representation 

The classical representation stores a residue (class) a as an integer < a < A^. 
Residues are thus always fully reduced, i.e., in canonical form. 
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Another non-redundant form consists in choosing a symmetric represen- 
tation, say —N/2 < a < N/2. This form might save some reductions in 
additions or subtractions (see §2.2p . Negative numbers might be stored ei- 
ther with a separate sign (sign-magnitude representation) or with a two's- 
complement representation. 

Since takes n words in base /3, an alternative redundant representation 
chooses < a < /3" to represent a residue class. If the underlying arithmetic 
is word-based, this will yield no slowdown compared to the canonical form. 
An advantage of this representation is that, when adding two residues, it 
suffices to compare their sum to P"^ in order to decide whether the sum has 
to be reduced, and the result of this comparison is simply given by the carry 
bit of the addition (see Algorithm Integer Addition in §1.2p . instead of by 
comparing the sum with A^. However, in the case that the sum has to be 
reduced, one or more further comparisons are needed. 

2.1.2 Montgomery's Form 

Montgomery's form is another representation widely used when several mod- 
ular operations have to be performed modulo the same integer A^ (additions, 
subtractions, modular multiplications). It implies a small overhead to con- 
vert — if needed — from the classical representation to Montgomery's and 
vice- versa, but this overhead is often more than compensated by the speedup 
obtained in the modular multiplication. 

The main idea is to represent a residue a by a' = aR mod A^, where R = 
and A^ takes n words in base /3. Thus Montgomery is not concerned with 
the physical representation of a residue class, but with the meaning associated 
to a given physical representation. (As a consequence, the different choices 
mentioned above for the physical representation are all possible.) Addition 
and subtraction are unchanged, but (modular) multiplication translates to a 
different, much simpler, algorithm ( §2.4.2p . 

In most applications using Montgomery's form, all inputs are first con- 
verted to Montgomery's form, using a' = aR mod A^, then all computations 
are performed in Montgomery's form, and finally all outputs are converted 
back — if needed — to the classical form, using a = a' /R mod A^. We need 
to assume that {R,N) = 1, or equivalently that {/3,N) = 1, to ensure the 
existence of mod A^. This is not usually a problem because /3 is a power 
of two and A^ can be assumed to be odd. 
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classical (MSB) 


p-adic (LSB) 


Euclidean division 
Svoboda's algorithm 
Euclidean gcd 
Newton's method 


Hensel division, Montgomery reduction 
Montgomery-Svoboda 
binary gcd 
Hensel lifting 



Figure 2.1: Equivalence between LSB and MSB algorithms. 



2.1.3 Residue Number Systems 

In a Residue Number System, a residue a is represented by a list of residues 
Oj modulo Ni, where the moduli Ni are coprime and their product is A^. The 
integers can be efficiently computed from a using a remainder tree, and 
the unique integer < a < = A'^i7V2 ■ ■ ■ is computed from the by an 
Explicit Chinese Remainder Theorem ( §2.7p . The residue number system 
is interesting since addition and multiplication can be performed in parallel 
on each small residue a^. This representation requires that factors into 
convenient moduli Ni, N2, . . ., which is not always the case (see however §2.9p . 
Conversion to/from the RNS representation costs 0{M{n) logn), see §2.71 

2.1.4 MSB vs LSB Algorithms 

Many classical (most significant bits first or MSB) algorithms have a p-adic 
(least significant bits first or LSB) equivalent form. Thus several algorithms 
in this chapter are just LSB-variants of algorithms discussed in Chapter [T] 
(see Figure I2TT]) . 

2.1.5 Link with Polynomials 

As in Chapter [H a strong link exists between modular arithmetic and arith- 
metic on polynomials. One way of implementing finite fields Fg with q = p"" 
elements is to work with polynomials in Fp[x], which are reduced modulo a 
monic irreducible polynomial f{x) G ¥p[x] of degree n. In this case modular 
reduction happens both at the coefficient level (in Fp) and at the polynomial 
level (modulo f{x)). 
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Some algorithms work in the ring (Z/A^Z)[x], where is a composite in- 
teger. An important case is the Schonhage-Strassen multiphcation algorithm, 
where has the form 2^ + 1. 

In both domains ¥p[x] and (Z/A^Z)[x], the Kronecker-Schonhage trick 
( §1.31) can be applied efficiently. Since the coefficients are known to be 
bounded, by p and A^ respectively, and thus have a fixed size, the segmen- 
tation is quite efficient. If polynomials have degree d and coefficients are 
bounded by A^, the product coefficients are bounded by dN"^, and one obtains 
0{M{d\og{Nd))) operations, instead of 0(M(ci)M(log A^)) with the classi- 
cal approach. Also, the implementation is simpler, because we only have to 
implement fast arithmetic for large integers instead of fast arithmetic at both 
the polynomial level and the coefficient level (see also Exercises 11.21 and 12.41) . 

2.2 Modular Addition and Subtraction 

The addition of two residues in classical representation can be done as in 
Algorithm ModularAdd. 

Algorithm 2.1 ModularAdd 

Input: residues a, b with < a, 6 < A^ 

Output: c = a + b mod A^ 

c a + 6 

if c > A^ then 

c-N. 



Assuming that a and b are uniformly distributed in Z fl [0, A^ — 1], the 
subtraction c ^ c — A^ is performed with probability (1 — l/N)/2. If we 
use instead a symmetric representation in [— A^/2, N/2), the probability that 
we need to add or subtract A^ drops to 1/4 + 0{1/N^) at the cost of an 
additional test. This extra test might be expensive for small A^ — say one 
or two words — but should be relatively cheap if A" is large enough, say at 
least ten words. 

2.3 The Fourier Transform 

In this section we introduce the discrete Fourier transform (DFT). An im- 
portant application of the DFT is in computing convolutions via the Convo- 
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lution Theorem. In general, the convolution of two vectors can be computed 
using three DFTs (for details see §2.9p . Here we show how to compute the 
DFT efficiently (via the Fast Fourier Transform or FFT), and show how it 
can be used to multiply two n-bit integers in time 0(?7,lognloglogri) (the 
Schonhage-Strassen algorithm, see §2.3.3p . 



2.3.1 Theoretical Setting 

Let i? be a ring, > 2 an integer, and u a i^-th principal root of unity in 
i?, i.e., such that = 1 and X]^^^'^*'' = for 1 < i < K. The Fourier 
transform (or forward (Fourier) transform) of a vector a = [oq, ai, . . . , ax-i] 
of K elements from R is the vector a = \aQ,ai, . . . ,aK-i] such that 

K-l 

a^ = ^u'^aj. (2.1) 

j=0 

If we transform the vector a twice, we get back to the initial vector, 
apart from a multiplicative factor K and a permutation of the elements of 
the vector. Indeed, for < i < K , 

K-l K-l K-l K-l /K-l 

% = J2 = E E ^''^^ = E E 

j=o j=o e=o e=o \j=o 

Let r = w*"*"^. If i + £ 7^ mod K, i.e., if i + £ is not or K, the sum "^fS^ 
vanishes since u is principal. For i + £ G {0, K} we have r = 1 and the sum 
equals K. It follows that 

K-l 



K ^ ai = Ka(^_i) mod k- 



Thus we have a = K[aQ, Qk-i, a'K-2, ■ ■ ■ ,0,2, ai]. 

If we transform the vector a twice, but use instead of uj for the second 
transform (which is then called a backward transform), we get: 

^ K-l K-l K-l K-l /K-l 

^, = Y, = Yl E ^''"^ = E «M E 

j=o j=o e=o £=0 \j=o 
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The sum ^^q^ vanishes unless i = i, in which case it equals K. Thus 

we have = Kai. Apart from the multiplicative factor K, the backward 
transform is the inverse of the forward transform, as might be expected from 
the names. 

2.3.2 The Fast Fourier Transform 

If evaluated naively, Eqn. (12. ip requires operations to compute the 

Fourier transform of a vector of K elements. The Fast Fourier Transform 
or FFT is an efficient way to evaluate Eqn. fl2.1l) . using only O(A'logA') 
operations. From now on we assume that K is a power of two, since this is 
the most common case and simplifies the description of the FFT (see §2.91 
for the general case). 

Let us illustrate the FFT for K = 8. Since = 1, we have reduced the 
exponents modulo 8 in the following. We want to compute: 

o-o = Co + cii + a2 + as + 04 + as + ag + ay, 

•^"^ O Q /I ^ f\ ^ 

0-1 = ao + ujai + Lu a2 + + u a4 + u + u + u aj, 

0-2 = ao + cj^ai + uj^a2 + uj^a^ + a4 + co'^a^ + UJ^ae + cj^ay, 

as = ao + u ai + u a2 + i^as + u a4 + u + u + u aj, 

04 = ao + w^ai + a2 + w^as + a4 + co^a^ + ag + w^ay, 

a5 = ao + ai + a; a2 + as + a4 + coa^ + uj aQ + 00 aj, 

ae = ao + u^ai + a;'^a2 + w^as + a4 + u^a^ + u^gq + uP'a-j, 

'■"^ r-T fr ^ o o 

a? = ao + w ai + a; a2 + w as + w a4 + w as + a; ae + way. 

We see that we can share some computations. For example, the sum ao + a4 
appears in four places: in ao, 02, 04 and ag. Let us define ao,4 = ao + a4, 
ai,5 = ai + as, a2,6 = a2 + a6, as, 7 = as + ay, a4,o = ao + Ci;^a4, as^i = ai+oj^as, 
a6,2 = a2 + c<;^a6, ay^s = a^ + u^ar. Then we have, using the fact that = 1: 



ao — ao,4 + ai^s + '^2,6 + o-sj^ 0,1 

a2 = ao,4 + w^ai^s + i^^a2,6 + w'^asj, % 

04 = ao,4 + w'^ai^s + «2,6 + 1^^03,75 0^5 

ae = ao,4 + (^^0,1,5 + (^^a2,6 + tu^as,?, aj 



= a^fl + cjas,i + a;^a6,2 + (^^0-7,3, 
= a^fl + w^as,i + u;^a6,2 + '^ay^s, 
= a4,o + '^^as,! + u;^a6,2 + (^'^0-7,3, 
= a4,o + cj'^as,! + w^a6,2 + (^^0,7,3- 



Now the sum ao,4+a2,6 appears at two different places. Let ao,4,2,6 = ao,4+a2,6, 

<^l,5,3,7 = Ctl,5 + asj, a2,6,0,4 = ao,4 + I^^a2,6; %,7,1,5 = 0,1,5 + '^'^'^S,?; '^4,0,6,2 = 
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2 2 6 6 

04,0+1^ '^6,2, C^S, 1,7,3 = 05,1+1^ '^7,3; '^6,2,4,0 = 04,0+1^ '^6,2, '^7,3, 5,1 = 05,1+1^ '27,3- 

Then we have 

^^4,0,6,2 + l^'25,l,7,37 
06,2,4,0 + ^ ^7,3,5,1! 
04,0,6,2 + 1^^05,1,7,3! 
06,2,4,0 + ^ 0-7, 3,5,1- 

In summary, after a first stage where we have computed 8 intermediary vari- 
ables ao,4 to 07^3, and a second stage with 8 extra intermediary variables 
Oo,4,2,6 to 07,3,5,1, we are able to compute the transformed vector in 8 extra 
steps. The total number of steps is thus 24 = 8 Ig 8, where each step has the 
form a ^ b + u^c. 

If we take a closer look, we can group operations in pairs (a, a') which 
have the form a = b + u^c and a' = b + u^^'^c. For example, in the first stage 
we have 01,5 = 01 + 05 and 05,1 = ai + u^a^; in the second stage we have 
04,0,6,2 = 04,0 + Ci;^06,2 and 05,2,4,0 = 04,0 + w^06,2- Since = —1, this can 
also be written (a, a') = (6 + u^c, b — u^c), where u^c needs to be computed 
only once. A pair of two such operations is called a butterfly operation. 

The FFT can be performed in place. Indeed, the result of the butterfly 
between and 04, that is (09,4,04,0) = (oq + 04,09 — 04), can overwrite 
(oq, 04), since the values of oq and 04 are no longer needed. 

Algorithm ForwardFFT is a recursive and in-place implementation of 
the forward FFT. It uses an auxiliary function bitrev(j, K) which returns 
the bit-reversal of the integer j, considered as an integer of IgK bits. For 
example, bitrev(j, 8) gives 0, 4, 2, 6, 1, 5, 3, 7 for j = 0, ... , 7. 

Algorithm 2.2 ForwardFFT 

Input: vector a = [oq, Oi, . . . , ax-i], principal K-th root of unity, K = 2^ 
Output: in-place transformed vector a, bit-reversed 
1: if K = 2 then 

2: [Oq, Oi] ^ [Oq + Oi, Oq — Oi] 

3: else 

4: [oo, 02, Oi^„2] <- ForwardFFT([ao, 02, ai^_2], w^, K/2) 
5: [ai, as, ax-i] ^ ForwardFFT([ai, 03, aK-i],u}'^, K/2) 
6: for j from to K/2 - 1 do 

7: [o2,-, a2j+i] ^ [a2j + uj'''''^^^^'''/^^a2j+i, 02, - a;'^i*-^(^'^/2)a2,+i]. 



Oq — Oo,4,2,6 + Ol, 5,3,7, Oi — 

O2 = 02,6,0,4 + 1^^03,7,1,5, 03 = 

^ I 4 ^ 

O4 — Oo,4,2,6 + ^ Ol, 5,3,7, 05 — 

06 = 02,6,0,4 + l^^03 7 1 5, aj = 
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Theorem 2.3.1 Given an input vector a = [ao,ai, . . . ,aK-i], Algorithm 
ForwardFFT replaces it by its Fourier transform, in bit-reverse order, in 
0{K log K) operations in the ring R. 

Proof. We prove the statement by induction on K = 2^. For K = 2, 
the Fourier transform of [ao;*^!] is [c^o + «i,oo + i^^i], and the bit-reverse 
order coincides with the normal order; since w = — 1, the statement foUows. 
Now assume the statement is true for K/2. Let < j < K/2, and write 
/ := bitrev(j, K/2). Let b = [6o, hK/2-i\ be the vector obtained at step HI 
and c = [co, be the vector obtained at step By induction: 

K/2-1 K/2-1 

Since bj is stored at a2j and Cj at a2j+i, we compute at step [2 

K/2-1 K/2-1 K-l 

a2j = bj + Cj = u'^^ ^a2i + w^-' ^02^+1 = ^^'^'^ ^cit = 

e=o e=o e=o 

Similarly, since —u^' = u^^'^'^^': 

K/2-1 K/2-1 

t=o e=o 

where we used the fact that u"^^' = up'^^'^^l'^'^ . Since bitrev(2j, i^) = 
bitrev(j, i^'/2) and bitrev(2j + 1,K) = Kj2 + bitrev(j, ir/2), the first part 
of the theorem follows. The complexity bound follows from the fact that the 
cost T[K) satisfies the recurrence T[K) < 2T{K/2) + 0{K). □ 

Theorem 2.3.2 Given an input vector sl = [a^, aK/2, ■ ■ ■ , clk-i] in bit-reverse 
order, Algorithm BackwardFFT replaces it by its backward Fourier trans- 
form, in normal order, in 0{K log K) operations in R. 
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Algorithm 2.3 BackwardFFT 

Input: vector a bit-reversed, u principal K-th root of unity, K = 2'^ 
Output: in-place transformed vector a, normal order 
1: ii K = 2 then 

2: [ao,ai] ^ [ao + ai,ao - ai] 
3: else 

4: [ao, 0^-/2-1] ^ BackwardFFT ([ao, ax/2-1], K/2) 

5: [aK/2, ctA'-i] ^ BackwardFFT([ax/2, a^-i], c<;^ K/2) 

6: for j from to K/2 - 1 do > uj~^ = u^~^ 

7: [a^, aK/2+j\ ^ [aj + a;~%ii-/2+j, aj - uj~^aK/2+j\- 



Proof. The complexity bound follows as in the proof of Theorem 12.3. II For 
the correctness result, we again use induction on K = 2^. For K = 2 the 
backward Fourier transform a = [a^ + ai,ao + u~^ai] is exactly what the 
algorithm returns, since u = = —1 in that case. Assume now K > 4, 
a power of two. The first half, say b, of the vector a corresponds to the 
bit-reversed vector of the even indices, since bitrev(2j, /T) = bitrev(j, ii'/2). 
Similarly, the second half, say c, corresponds to the bit-reversed vector of 
the odd indices, since bitrev(2j -|- 1, K) = i^/2 + bitrev(j, K/2). Thus we can 
apply the theorem by induction to b and c. It follows that b is the backward 
transform of length K/2 with u"^ for the even indices (in normal order), and 
similarly c is the backward transform of length K/2 for the odd indices: 

K/2~l K/2-1 

Since bj is stored in aj and Cj in aK/2+ji we have: 

K/2-1 K/2-1 
ttj = bj + u;~^Cj = uj~'^^^a2£ + uj~^ uj~'^^^a2i+i 

£=0 £=0 

K-l 
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and similarly, using —cu ^ = cu ^1"^ and w "^^ = u 2(x/2+i). 

A72-I Kl2-\ 
K-l 



£=0 

□ 



2.3.3 The Schonhage-Strassen Algorithm 

We now describe the Schonhage-Strassen 0(n log n log log n) algorithm to 
multiply two integers of n bits. The heart of the algorithm is a routine to 
multiply two integers modulo 2" + 1. 

Algorithm 2.4 FFTMulMod 

Input: < A, B < 2" + 1, an integer K = 2^ such that n = MK 
Output: C = A-B mod (2" + 1) 

1: decompose A = J^fs^^ a^V^'^ with < a^- < 2^, except < aK^i < 2^ 
2: decompose B similarly 

3: choose n' > 2n/K + k, n' multiple of K; let 9 = 2"'/-^, a; = 9^ 
4: for j from to — 1 do 

5: {aj,bj) ^ {9^ a j, 9^ bj) mod (2^^' + 1) 

6: a ^ ForwardFFT(a, A'), b ^ ForwardFFT(b, K) 

7: for j from to AT - 1 do > call FFTMulMod 

8: Cj Ojfej mod (2" +1) > recursively if n' is large 

9: c ^ BackwardFFT(c,a;, AT) 
10: for j from to — 1 do 

11: 9 ^ Cj/{K9^) mod (2"' + 1) 
12: if 9 > (j + 1)22^^ then 
13: Cj ^ Cj - (2"' + 1) 

14: C = Ejro'92^''- 



Theorem 2.3.3 Given < < 2" + 1, Algorithm FFTMulMod cor- 
rectly returns A-B mod (2"+l) , anc? cosis 0{n log n log log n) bit- operations 



Modern Computer Arithmetic, §2.3 



61 



Proof. The proof is by induction on n, because at step M we call FFT- 
MulMod recursively unless n' is sufficiently small that a simpler algorithm 
(classical, Karatsuba or Toom-Cook) can be used. There is no difficulty in 
starting the induction. 

With aj,bj the values at steps [T] and [H we have A = Yl!j=o '^j'^^^^ ^"^^ 
B = Y.f=Q thus A-B = Y.fS'o cj2i^^ mod (2" + 1) with 

K-l K-l 

e+m=j l + m=K+j 

We have (j + 1 — K)2'^^^ < cj < {j + 1)2^^^, since the ffist sum contains j + 1 
terms, the second sum K — {j + 1) terms, and at least one of ai and bm is 
less than 2*'^ in the ffist sum. 

Let a'j be the value of aj after step |3 a'j = O^aj mod (2" +1), and 
similarly for h'y Using Theorem 12.3. H after step [6] we have abitrev(j,_ft:) = 
XlfLo^ '^^"''2^ mod (2"' + 1), and similarly for h. Thus at steplHl 

Cbitrev(i,K) = UJ^'a'^ ^""'K. ■ 

\^=0 / \m=0 / 

After step [HI using Theorem I2.3.2t 

j=0 \i=0 J \m=0 J 

K-l K-l 

The ffist sum equals 0' J2£+m=i (^^^rn] the second is d^'^' J2e+m=K+i^i^rn- 
Since 6^ = —1 mod (2" + 1), after step [11] we have: 

K-l K-l 

l + m=i e+m^K + i 

The correction at step [T3I ensures that Cj lies in the correct interval, as given 
by Eqn. (1^ . 
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For the complexity analysis, assume that K = Q{^/n). Thus we have 
n' = Q{y/n). Steps [1] and [2] cost 0{n); step |5] also costs 0(n) (counting 
the cumulated cost for all values of j). Step [6] costs 0{K\ogK) times the 
cost of one butterfly operation mod (2" +1), which is 0{n'), thus a total 
of 0{Kn' log K) = O(nlogn). Step [HI using the same algorithm recursively, 
costs 0{n' log n' log log n') per value of j by the induction hypothesis, giving a 
total of 0(ti logn loglogra). The backward FFT costs 0{n\ogn) too, and the 
final steps cost 0{n), giving a total cost of 0(n log n log log ra). The log log n 
term is the depth of the recursion, each level reducing n to n' = 0{y/n). q 

Example: to multiply two integers modulo (2^°^^^^^ + 1); we can take 
K = 2^" = 1024, and n' = 3072. We recursively compute 1024 products 
modulo (2'^°^^ + 1). Alternatively, we can take the smaller value K = 512, 
with 512 recursive products modulo (2^^°* + !)• 

Remark 1: the "small" products at step[8](mod {2^'^'^'^ + l) or mod (2'^^°* + l) 
in our example) can be performed by the same algorithm applied recursively, 
but at some point (determined by details of the implementation) it will be 
more efficient to use a simpler algorithm, such as the classical or Karatsuba 
algorithm (see §1.3p . In practice the depth of recursion is a small constant, 
typically 1 or 2. Thus, for practical purposes, the log log n term can be 
regarded as a constant. For a theoretical way of avoiding the log log n term, 
see the comments on Fiirer's algorithm in §2.91 

Remark 2: if we replace ^ by 1 in Algorithm FFTMulMod, i.e., remove 
stepO replace step [11] by Cj Cj/K mod (2" +1), and replace the condition 
at step[T2]by cj > , then we compute C = A-B mod (2" — 1) instead of 

mod (2" + l). This is useful, for example, in McLaughlin's algorithm ( §2.4.3p . 

Algorithm FFTMulMod enables us to multiply two integers modulo 
(2" + l) in 0(?7.1og?7,loglog?7,) operations, for a suitable n and a corresponding 
FFT length K = 2^. Since we should have K ^ i/ri and K must divide n, 
suitable values of n are the integers with the low-order half of their bits zero; 
there is no shortage of such integers. To multiply two integers of at most n 
bits, we first choose a suitable bit size m > 2n. We consider the integers as 
residues modulo (2*" + 1), then Algorithm FFTMulMod gives their integer 
product. The resulting complexity is 0(nlognloglog?T.), since m = 0{n). In 
practice the log log n term can be regarded as a constant; theoretically it can 
be replaced by an extremely slowly-growing function (see Remark 1 above). 

In this book, we sometimes implicitly assume that n-bit integer multi- 
plication costs the same as three FFTs of length 2n, since this is true if an 
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FFT-based algorithm is used for multiplication. The constant "three" can 
be reduced if some of the FFTs can be precomputed and reused many times, 
for example if some of the operands in the multiplications are fixed. 

2.4 Modular Multiplication 

Modular multiplication means computing A ■ B mod A^, where A and B are 
residues modulo A^. Of course, once the product C = A ■ B has been com- 
puted, it suffices to perform a modular reduction C mod A^, which itself re- 
duces to an integer division. The reader may ask why we did not cover this 
topic in §1.41 There are two reasons. First, the algorithms presented below 
benefit from some precomputations involving iV, and are thus specific to the 
case where several reductions are performed with the same modulus. Sec- 
ond, some algorithms avoid performing the full product C = A - B] one such 
example is McLaughlin's algorithm ( §2.4.3p . 

Algorithms with precomputations include Barrett's algorithm ( §2.4.ip . 
which computes an approximation to the inverse of the modulus, thus trading 
division for multiplication; Montgomery's algorithm, which corresponds to 
Hensel's division with remainder only ( §1.4.8p . and its subquadratic variant, 
which is the LSB-variant of Barrett's algorithm; and finally McLaughlin's 
algorithm ( §2.4.3p . The cost of the precomputations is not taken into account: 
it is assumed to be negligible if many modular reductions are performed. 
However, we assume that the amount of precomputed data uses only linear, 
that is O(logA^), space. 

As usual, we assume that the modulus A^ has n words in base /3, that A 
and B have at most n words, and in some cases that they are fully reduced, 
i.e., Q<A,B<N. 

2.4.1 Barrett's Algorithm 

Barrett's algorithm is attractive when many divisions have to be made with 
the same divisor; this is the case when one performs computations modulo 
a fixed integer. The idea is to precompute an approximation to the inverse 
of the divisor. Thus, an approximation to the quotient is obtained with just 
one multiplication, and the corresponding remainder after a second multipli- 
cation. A small number of corrections suffice to convert the approximations 
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into exact values. For the sake of simplicity, we describe Barrett's algorithm 
in base /3, where /3 might be replaced by any integer, in particular 2" or /?"■. 

Algorithm 2.5 BarrettDivRem 

Input: integers A, B with < A < (3'^, (3 /2 < B < (3 

Output: quotient Q and remainder R oi A divided by B 

1: / ^ IP'^/B\ > precomputation 

2: Q ^ [Ail/f3\ where A = Ai/3 + Aq with < Aq < /S 

3: R<- A-QB 

4: while R> B do 

5: iQ,R) ^ iQ + l,R- B) 

6: return {Q,R). 



Theorem 2.4.1 Algorithm BarrettDivRem is correct and steplE is per- 
formed at most 3 times. 

Proof. Since A = QB+R is invariant in the algorithm, we just need to prove 
that < i? < 5 at the end. We first consider the value of Q, R before the 
while-loop. Since /3/2 < 5 < /3, we have (3 < P^B < 2/3, thus (3 <I <2(3. 
We have Q < AiI//3 < Ai(3 / B < A/B. This ensures that R is nonnegative. 
Now / > (3'^/B — 1, which gives 

IB > 13^'- B. 

Similarly, Q > Ail / (3 — 1 gives 

PQ > AJ - [3. 

This yields > AJB-^B > Ai{l3^-B)-l3B = /3{A-Ao)-B{(3+Ai) > 
I3A-A(3B since Aq < (3 < 2B and Ai < (3. We conclude that A < E(Q + 4), 
thus at most 3 corrections are needed. □ 

The bound of 3 corrections is tight: it is attained for A = 1980, B = 36, 
P = 64. In this example / = 113, Ai = 30, Q = 52, R = 108 = 3B. 

The multiplications at steps [2] and [3] may be replaced by short products, 
more precisely the multiplication at step [2] by a high short product, and that 
at step [3] by a low short product (see §3.3p . 

Barrett's algorithm can also be used for an unbalanced division, when 
dividing [k + l)n words by n words for k > 2, which amounts to k divisions 
of 2n words by the same n-word divisor. In this case, we say that the divisor 
is implicitly invariant. 
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Complexity of Barrett's Algorithm 

If the multiplications at steps [2] and [3] are performed using full products, 
Barrett's algorithm costs 2M{n) for a divisor of size n. In the FFT range, 
this cost might be lowered to 1.5M(n) using the "wrap-around trick" ( §3.4.1I) : 
moreover, if the forward transforms of / and B are stored, the cost decreases 
to M(n), assuming M{n) is the cost of three FFTs. 

2.4.2 Montgomery's Multiplication 

Montgomery's algorithm is very efficient for modular arithmetic modulo 
a fixed modulus A^. The main idea is to replace a residue A mod N by 
A' = XA mod A^, where A' is the "Montgomery form" corresponding to the 
residue A, with A an integer constant such that gcd(A^, A) = 1. Addition and 
subtraction are unchanged, since XA + XB = X{A + B) mod N. The mul- 
tiplication of two residues in Montgomery form does not give exactly what 
we want: {XA){XB) ^ X{AB^ mod A. The trick is to replace the classical 
modular multiplication by "Montgomery's multiplication": 

N B' 

MontgomeryMul(A', B') = —— mod A. 

A 

For some values of A, MontgomeryMul(y4', B') can easily be computed, in 
particular for A = where A uses n words in base /3. Algorithm 12.61 is a 
quadratic algorithm (REDC) to compute MontgomeryMul(A', B') in this 
case, and a subquadratic reduction (FastREDC) is given in Algorithm 12.71 
Another view of Montgomery's algorithm for A = /3" is to consider that 
it computes the remainder of Hensel's division ( §1.4.81) . 

Algorithm 2.6 REDC (quadratic non-interleaved version). The Cj form the 
current base-/? decomposition of C, i.e., they are defined by C = Xlo""^ ^if^^ ■ 
Input: < C < Z^^'^, A < /X ^ -A'^ mod /3, (/3, A) = 1 
Output: < < /3" such that R = C(3-'' mod A 
1: for i from to — 1 do 

2: Qi fiCi mod (3 o quotient selection 

3: C + QiNlS' 

A: R i— C/S"" > trivial exact division 

5: if i? > /3" then return R — N else return R. 
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Theorem 2.4.2 Algorithm REDC is correct. 

Proof. We first prove that R = Cf3~^ mod A^: C is only modified in step [HI 
which does not change C mod A^, thus at step H] we have R = C/3~" mod A^, 
and this remains true in the last step. 

Assume that, for a given i, we have C = mod when entering stepO 
Since = —Ci/N mod (3, we have C + qiNf3^ = mod (3^^^ at the next step, 
so the next value of Cj is 0. Thus, on exiting the for-loop, C is a multiple of 
f3"', and R is an integer at step HI 

Still at step H we have C < /J^" + (/? - l)iV(l + /3 + . . . + = 
+ A^(/3" - 1), thus R< + N and R- N < P"". □ 

Compared to classical division (Algorithm BasecaseDivRem, §1.4. ip . 

Montgomery's algorithm has two significant advantages: the quotient selec- 
tion is performed by a multiplication modulo the word base /3, which is more 
efficient than a division by the most significant word 6„„i of the divisor as 
in BasecaseDivRem; and there is no repair step inside the for-loop — the 
repair step is at the very end. 

For example, with inputs C = 766 970 544 842 443 844, N = 862 664 913, 
and /3 = 1000, Algorithm REDC precomputes fi = 23; then we have 
go = 412, which yields C <- C + 412A^ = 766 970 900 260 388 000; then 
qi = 924, which yields C ^ C + 924iV/3 = 767 768 002 640 000 000; then 
q2 = 720, which yields C <- C + 720N(3^ = 1388 886 740 000 000 000. At 
step m i? = 1388 886 740, and since R > /3^, REDC returns R - N = 
526 221827. 

Since Montgomery's algorithm — i.e., Hensel's division with remainder 
only — can be viewed as an LSB variant of classical division, Svoboda's 
divisor preconditioning ( §1.4.2p also translates to the LSB context. More 
precisely, in Algorithm REDC, one wants to modify the divisor N so that the 
quotient selection q <(— /iCj mod /3 at step [2] becomes trivial. The multiplier 
k used in Svoboda division is simply the parameter /i in REDC. A natural 
choice is /i = 1, which corresponds to = — 1 mod /3. This motivates the 
Montgomery-Svoboda algorithm, which is as follows: 

1. first compute A^' = ^N, with A^' < /S""*"^, where fi = —1/N mod /3; 

2. perform the n — 1 first loops of REDC, replacing /i by 1, and A^ by 
AT'; 
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3. perform a final classical loop with fi and A^, and the last steps (jlHS]) 
from REDC. 

Quotient selection in the Montgomery-Svoboda algorithm simply involves 
"reading" the word of weight /3* in the divisor C. 

For the example above, we get A^' = 19 841 292 999; go is the least signifi- 
cant word of C, i.e., qo = 844, so C ^ C + 8AAN' = 766 987 290 893 735 000; 
then qi = 735 and C ^ C + 735N'(3 = 781 570 641 248 000 000. The last step 
gives q2 = 704 and C ^ C + 70ANl3^ = 1 388 886 740 000 000 000, which is 
what we found previously. 

Subquadratic Montgomery Reduction 

A subquadratic version FastREDC of Algorithm REDC is obtained by 
taking n = 1, and considering /3 as a "giant base" (alternatively, replace /3 
by /3" below): 



Algorithm 2.7 FastREDC (subquadratic Montgomery reduction) 

Input: <C < N < (3, -1/N mod (3 
Output: 0<R< (3 such that R = C/(3 mod A^ 

1: Q ^ fiC mod (3 

2: R^{C + QN)/(3 

3: ii R > (3 then return R — N else return R. 



This is exactly the 2-adic counterpart of Barrett's subquadratic algorithm; 
steps [lH2] might be performed by a low short product and a high short product 
respectively. 

When combined with Karatsuba's multiplication, assuming the products 
of steps [TH2] are full products, the reduction requires 2 multiplications of 
size n, i.e., 6 multiplications of size n/2 {n denotes the size of A^, /3 being 
a giant base). With some additional precomputation, the reduction might 
be performed with 5 multiplications of size n/2, assuming n is even. This 
is simply the Montgomery-Svoboda algorithm with A^ having two big words 
in base The cost of the algorithm is M{n,n/2) to compute qoN' (even 
if A^' has in principle 3n/2 words, we know A^' = H(3'^^'^ — 1 with H < /J", 
thus it suffices to multiply go by H), M(n/2) to compute fiC mod /J""/^, and 
again M{n,n/2) to compute giA^, thus a total of 5M{n/2) if each n x {n/2) 
product is realized by two {n/2) x {n/2) products. 
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Algorithm 2.8 MontgomerySvoboda2 

Input: < C < iV < /i ^ -1/iV mod N' = fiN 

Output: < i? < /3" such that R = Cf/S'' mod 

1: qo^C mod 

2: C ^ (C + goA^')//3"/^ 
3: gi ^ mod /J'^/^ 

4: /2 ^ (C + qiN)/l3-^/'^ 

5: if _R > /3" then return R — N else return i?. 



The algorithm is quite similar to the one described at the end of §1.4.6[ 
where the cost was 3M{n/2) + D[n/2) for a division of 2n by n with re- 
mainder only. The main difference here is that, thanks to Montgomery's 
form, the last classical division D{n/2) in Svoboda's algorithm is replaced 
by multiplications of total cost 2M(n/2), which is usually faster. 

Algorithm MontgomerySvoboda2 can be extended as follows. The 
value C obtained after step [2] has 3?7,/2 words, i.e., an excess of n/2 words. 
Instead of reducing that excess with REDC, one could reduce it using 
Svoboda's technique with /i' = -1/A mod and N" = fi'N. This 
would reduce the low n/4 words from C at the cost of M(?7,,n/4), and a 
last REDC step would reduce the final excess of n/4, which would give 
D{2n,n) = M{n,n/2) + M{n,n/A) + M(n/4) + M(n,n/4). This "folding" 
process can be generalized to D{2n,n) = M{n,n/2) + ■ ■ ■ + M(n,n/2'^) + 
M{n/2^) + M(n,n/2'=). If M{n,n/2'') reduces to 2^M{n/2^), this gives: 

D{n) = 2M(n/2) + 4M(n/4) + ■ ■ ■ + 2'=-^M(n/2'=-^) + (2'^+^ + l)M(n/2'=). 

Unfortunately, the resulting multiplications become more and more unbal- 
anced, and we need to store k precomputed multiples A', A", ... of A, each 
requiring at least n words. Figure 1212) shows that the single-folding algorithm 
is the best one. 

Exercise 12.61 discusses further possible improvements in the Montgomery- 
Svoboda algorithm, achieving D{n) ^ 1.58M(n) in the case of Karatsuba 
multiplication. 

2.4.3 McLaughlin's Algorithm 

McLaughlin's algorithm assumes one can perform fast multiplication modulo 
both 2" — 1 and 2" + 1, for sufficiently many values of n. This assumption is 
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Algorithm 


Karatsuba 


Toom-Cook 3-way 


Toom-Cook 4-way 


D{n) 


2.00M(n) 


2.63M(n) 


3.10M(n) 


1-folding 


1.67M{n) 


1.81M(n) 


1.89M(n) 


2-folding 


1.67M{n) 


1.91M(n) 


2.04M(n) 


3-folding 


1.7AM{n) 


2.06M(n) 


2.25M(n) 



Figure 2.2: Theoretical complexity of subquadratic REDC with 1-, 2- and 
3-folding, for different multiplication algorithms. 

true for example with the Schonhage-Strassen algorithm: the original version 
multiplies two numbers modulo 2" + l, but discarding the "twist" operations 
before and after the Fourier transforms computes their product modulo 2" — 1. 
(This has to be done at the top level only: the recursive operations compute 
modulo 2" + 1 in both cases. See Remark 2 on page 62.) 

The key idea in McLaughlin's algorithm is to avoid the classical "multiply 
and divide" method for modular multiplication. Instead, assuming that is 
relatively prime to 2"' — 1, it determines ^15/(2" — 1) mod with convolutions 
modulo 2" ± 1, which can be performed in an efficient way using the FFT. 

Algorithm 2.9 MultMcLaughhn 

Input: A, B with < A, < A^ < 2", /i = -A^^^ mod (2" - 1) 
Output: Afi/(2" - 1) mod A^ 

1: m ^ ABjj, mod (2" - 1) 

2: S ^ {AB + mN) mod (2" + 1) 

3: -S mod (2" + 1) 

4: if 2\w then s <- w/2 else s ^ (w + 2" + l)/2 

5: if AB + mN = s mod 2 then t^s else t ^ s + 2" + 1 

6: if t < A^ then return t else return t — N . 



Theorem 2.4.3 Algorithm MultMcLaughlin computes AB /{T^ — 1) mod 
A^ correctly, in ~1.5M(n) operations, assuming multiplication modulo 2"±1 
costs ~M(?7,/2), or the same as 3 Fourier transforms of size n. 

Proof. Step[T]is similar to step[l]of Algorithm FastREDC, with /3 replaced 
by 2"-l. It follows that AB + mN = mod (2"- 1), therefore we have AB + 
mN = A;(2" - 1) with < A; < 2A^. Step |2] computes S = -2k mod (2" + 1), 
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then stepOgives w = 2k mod (2" + l), and s = k mod (2" + l) in steplH Now, 
since < k < 2"'"'"^, the value s does not uniquely determine k, whose missing 
bit is determined from the least significant bit from AB + mN (step [5]). 
Finally, the last step reduces t = k modulo A^. 

The cost of the algorithm is mainly that of the four multiplications 
AB mod (2" ± 1), {AB)i2 mod (2" - 1) and mN mod (2" + 1), which cost 
4M(n/2) altogether. However, in {AB)jj, mod (2"-l) and mN mod (2" + l), 
the operands fi and are invariant, therefore their Fourier transforms can 
be precomputed, which saves 2M(n/2)/3 altogether. A further saving of 
M(n/2)/3 is obtained since we perform only one backward Fourier trans- 
form in step [21 Accounting for the savings gives (4 — 2/3 — l/3)M(n/2) = 
3M(n/2) ~ 1.5M(n). □ 

The ~1.5M(?7,) cost of McLaughlin's algorithm is quite surprising, since 
it means that a modular multiplication can be performed faster than two 
multiplications. In other words, since a modular multiplication is basically a 
multiplication followed by a division, this means that (at least in this case) 
the "division" can be performed for half the cost of a multiplication! 

2.4.4 Special Moduli 

For special moduli A^ faster algorithms may exist. The ideal case is A^ = 
f3^± 1. This is precisely the kind of modulus used in the Schonhage-Strassen 
algorithm based on the Fast Fourier Transform (FFT). In the FFT range, a 
multiplication modulo /3" ± 1 is used to perform the product of two integers 
of at most n/2 words, and a multiplication modulo costs ~M(n/2) ~ 

M(n)/2. 

For example, in elliptic curve cryptography (ECC), one almost always 
uses a special modulus, for example a pseudo-Mersenne prime like 2^^^— 2^^— 1 
or 2^^^ — 2^^^ + 2^^^ + 2^^ — 1. However, in most applications the modulus 
can not be chosen, and there is no reason for it to have a special form. 

We refer to §2.91 for further information about special moduli. 

2.5 Modular Division and Inversion 

We have seen above that modular multiplication reduces to integer division, 
since to compute ab mod A^, the classical method consists of dividing ab by 
A^ to obtain ab = qN + r, then ab = r mod A^. In the same vein, modular 
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division reduces to an (extended) integer gcd. More precisely, the division 
a/b mod is usually computed as a - (l/b) mod A^, thus a modular inverse is 
followed by a modular multiplication. We concentrate on modular inversion 
in this section. 

We have seen in Chapter [T] that computing an extended gcd is expensive, 
both for small sizes, where it usually costs the same as several multiplica- 
tions, and for large sizes, where it costs 0(M(n) log n). Therefore modular 
inversions should be avoided if possible; we explain at the end of this section 
how this can be done. 

Algorithm 12.101 (Modularlnverse) is just Algorithm ExtendedGcd 
( §1.6.2p . with (a, 6) {b,N) and the lines computing the cofactors of A^ 
omitted. 

Algorithm 2.10 Modularlnverse 
Input: integers b and A^, b prime to A^ 
Output: integer u = l/b mod A^ 
{u,w) (1,0), N 
while c 7^ do 

(g,r) i— DivRem(6, c) 
{b,c) ^ (c,r) 
(n, w) ^ {w, u — qw) 
return u. 



Algorithm Modularlnverse is the naive version of modular inversion, 
with complexity 0{n^) if A^ takes n words in base /3. The subquadratic 
0{M{n) logn) algorithm is based on the HalfBinaryGcd algorithm ( §1.6.3p . 

When the modulus A^ has a special form, faster algorithms may exist. In 
particular for A^ = p'', 0{M{n)) algorithms exist, based on Hensel lifting, 
which can be seen as the p-adic variant of Newton's method ( §4.21) . To 
compute l/b mod A^, we use a p-adic version of the iteration (14.51) : 

Xj+i = Xj + Xj{l — bxj) mod p'^. (2.3) 

Assume xj approximates l/b to "p-adic precision" i, i.e., bxj = 1 + ep^, and 
k = 2i. Then, modulo p*^: bxj+i = bxj{2—bxj) = {l+ep^){l—ep^) = l—e'^p^^. 
Therefore x^+i approximates l/b to double precision (in the p-adic sense). 

As an example, assume one wants to compute the inverse of an odd integer 
b modulo 2^^. The initial approximation xq = 1 satisfies Xq = l/b mod 2, thus 
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five iterations are enough. The first iteration is xi ^ xo + xo{l — bxo) mod 2^, 
which simphfies to xi -(—2 — 6 mod 4 since xq = 1. Now whether b = 1 mod 4 
or 6 = 3 mod 4, we have 2 — b = b mod 4, thus one can immediately start the 
second iteration with xi = b imphcit: 

X2 ^ b{2 - b"^) mod 2^, xs 4- X2(2 - 6x2) mod 2^, 
Xi ^ X2,{2 — 6x3) mod 2^^, X5 ^ ^4(2 — 6x4) mod 2^^. 

Consider for example b = 17. The above algorithm yields X2 = 1, X3 = 241, 
X4 = 61681 and X5 = 4 042 322161. Of course, any computation mod 
might be computed modulo p'^ for k > £. In particular, all the above compu- 
tations might be performed modulo 2'^^. On a 32-bit computer, arithmetic 
on basic integer types is usually performed modulo 2^"^, thus the reduction 
comes for free, and one can write in the C language (using unsigned variables 
and the same variable x for X2, . . . , X5): 

X = b*(2 - b*b) ; x *= 2 - b*x; x *= 2 - b*x; x *= 2 - b*x; 

Another way to perform modular division when the modulus has a special 
form is Hensel's division ( §1.4.8p . For a modulus = P"', given two integers 
A,B, we compute Q and R such that 

A = QB + 

Therefore we have A/B = Q mod While Montgomery's modular multi- 
plication only computes the remainder R of Hensel's division, modular divi- 
sion computes the quotient Q, thus Hensel's division plays a central role in 
modular arithmetic modulo 

2.5.1 Several Inversions at Once 

A modular inversion, which reduces to an extended gcd ( §1.6.2p . is usually 
much more expensive than a multiplication. This is true not only in the 
FFT range, where a gcd takes time Q{M{n) logn), but also for smaller num- 
bers. When several inversions are to be performed modulo the same number, 
Algorithm Multiplelnversion is usually faster. 

Theorem 2.5.1 Algorithm Multiplelnversion is correct. 

Proof. We have Zi = X1X2 . . .Xi mod A^, thus at the beginning of step Elfor 
a given i, q = {xi . . . Xi)~^ mod A^, which indeed gives yi = 1/xj mod A^. q 
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Algorithm 2.11 Multiplelnversion 

Input: < Xi, . . . ,Xk < N 

Output: Hi = 1/xi mod N, . . . ,yk = l/xk mod 

1: Zi ^ Xi 

2: for i from 2 to A; do 

3: Zi Zi^iXi mod 

4: q ^ 1/zk mod 

5: for i from k downto 2 do 

6: Hi qZi-i mod 
7: g ^ mod A^ 

8: yi <- q. 



This algorithm uses only one modular inversion (stepH]), and 3{k — 1) mod- 
ular multiplications. Thus it is faster than k inversions when a modular 
inversion is more than three times as expensive as a product. Figure 12.31 
shows a recursive variant of the algorithm, with the same number of modu- 
lar multiplications: one for each internal node when going up the (product) 
tree, and two for each internal node when going down the (remainder) tree. 
The recursive variant might be performed in parallel in 0(log/c) operations 
using 0(A;/logA;) processors. 




1/xi 1/X2 I/Xs l/Xi 



Figure 2.3: A recursive variant of Algorithm Multiplelnversion. First 
go up the tree, building xiX2 mod A^ from xi and X2 in the left branch, 
X3X4 mod A^ in the right branch, and xiX2X^Xi^ mod A^ at the root of the 
tree. Then invert the root of the tree. Finally go down the tree, multiplying 
ll{xiX2X^X4) by the stored value X3X4 to get l/(xiX2), and so on. 

A dual case is when there are several moduli but the number to invert 
is fixed. Say we want to compute 1/x mod A''i, . . . , 1/x mod A''^. We illus- 
trate a possible algorithm in the case /c = 4. First compute N = Ni . . . 
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using a product tree like that in Figure 1231 for example first compute N1N2 
and N3N4^, then multiply both to get TV = {NiN2){N3N4). Then compute 
y = 1/x mod A^, and go down the tree, while reducing the residue at each 
node. In our example we compute z = y mod {N1N2) in the left branch, 
then z mod A^i yields 1/x mod A'^i. An important difference between this al- 
gorithm and the algorithm illustrated in Figure [23] is that here, the numbers 
grow while going up the tree. Thus, depending on the sizes of x and the Nj, 
this algorithm might be of theoretical interest only. 

2.6 Modular Exponentiation 

Modular exponentiation is the most time-consuming mathematical opera- 
tion in several cryptographic algorithms. The well-known RSA public-key 
cryptosystem is based on the fact that computing 

c = mod A^ (2.4) 

is relatively easy, but recovering a from c, e and A^ is difficult when A^ has at 
least two (unknown) large prime factors. The discrete logarithm problem is 
similar: here c, a and A^ are given, and one looks for e satisfying Eqn. (12. 4p . 
In this case the problem is difficult when A^ has at least one large prime 
factor (for example, A^ could be prime). The discrete logarithm problem is 
the basis of the El Gamal cryptosystem, and a closely related problem is the 
basis of the Diffie-Hellman key exchange protocol. 

When the exponent e is fixed (or known to be small), an optimal sequence 
of squarings and multiplications might be computed in advance. This is 
related to the classical addition chain problem: What is the smallest chain 
of additions to reach the integer e, starting from 1? For example, if e = 15, 
a possible chain is: 

1, 1 + 1 = 2, 1 + 2 = 3, 1 + 3 = 4, 3 + 4 = 7, 7 + 7 = 14, 1 + 14 = 15. 

The length of a chain is defined to be the number of additions needed to com- 
pute it (the above chain has length 6). An addition chain readily translates 
to a multiplication chain: 

2 2 S 4'^4 777 14 14 15 

a,a ■ a = a ,a ■ a = a ,a ■ a = a ,a - a = a ,a -a = a ,a-a = a . 
A shorter chain for e = 15 is: 



1, 1 + 1 = 2, 1 + 2 = 3, 2 + 3 = 5, 5 + 5 = 10, 5 + 10 = 15. 
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This chain is the shortest possible for e = 15, so we write o"(15) = 5, where 
in general a{e) denotes the length of the shortest addition chain for e. In 
the case where e is small, and an addition chain of shortest length a{e) 
is known for e, computing a*^ mod TV may be performed in a{e) modular 
multiplications. 

When e is large and {a,N) = 1, then e might be reduced modulo (t>{N), 
where ^(A'") is Euler's totient function, i.e., the number of integers in [1, A^] 
which are relatively prime to A^. This is because whenever 
{a,N) = 1 (Fermat's little theorem). 

Since 4>{N) is a multiplicative function, it is easy to compute (p{N) if we 
know the prime factorisation of A^. For example, 

0(1001) = 0(7 ■ 11 ■ 13) = (7 - 1)(11 - 1)(13 - 1) = 720, 

and 2009 = 569 mod 720, so 17^'^^^ = 17^^^ mod 1001. 

Assume now that e is smaller than (f){N). Since a lower bound on the 
length a{e) of the addition chain for e is Ige, this yields a lower bound 
(Ig e)M(n) for modular exponentiation, where n is the size of A^. When e 
is of size k, a modular exponentiation costs 0{kM{n)). For k = n, the cost 
0{nM{n)) of modular exponentiation is much more than the cost of oper- 
ations considered in Chapter [1], with 0(M(n) log n) for the more expensive 
ones there. The different algorithms presented in this section save only a 
constant factor compared to binary exponentiation ( §2.6.11) . 
Remark: when a fits in one word but A^ does not, the shortest addition 
chain for e might not be the best way to compute mod A^, since in this 
case computing a ■ mod A^ is cheaper than computing a* ■ mod A^ for 
i > 2. 

2.6.1 Binary Exponentiation 

A simple (and not far from optimal) algorithm for modular exponentiation is 
binary (modular) exponentiation. Two variants exist: left-to-right and right- 
to-left. We give the former in Algorithm LeftToRightBinaryExp and leave 
the latter as an exercise for the reader. 

Left-to-right binary exponentiation has two advantages over right-to-left 
exponentiation: 

• it requires only one auxiliary variable, instead of two for the right-to- 
left exponentiation: one to store successive values of and one to 
store the result; 
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Algorithm 2.12 LeftToRightBinaryExp 
Input: a, e, positive integers 
Output: X = a'^ mod N 

1: let {ceci-i . . . eiCo) be the binary representation of e, with 6^ = 1 
2: X a 

3: for i from i — 1 downto do 

4: X x'^ mod N 

5: if ej = 1 then x ^ ax mod A^. 



• in the case where a is small, the multiplications ax at step E] always 
involve a small operand. 

If e is a random integer of £ + 1 bits, step [5] will be performed on average i/2 

times, giving average cost 3iM{n)/2. 

Example: for the exponent e = 3 499 211 612, which is 

(11 010 000 100 100 Oil Oil 101 101 Oil 100)2 

in binary. Algorithm LeftToRightBinaryExp performs 31 squarings and 
15 multiplications (one for each 1-bit, except the most significant one). 

2.6.2 Exponentiation With a Larger Base 

Compared to binary exponentiation, base 2'^ exponentiation reduces the 
number of multiplications ax mod N (Algorithm LeftToRightBinaryExp, 
step [5]). The idea is to precompute small powers of a mod N: 

Algorithm 2.13 BaseKExp 
Input: a, e, N positive integers 
Output: X = a"^ mod N 
1: precompute t[i] := a* mod A for 1 < z < 2^^ 

2: let {ciCi^i . . . CiCq) be the base 2^" representation of e, with 7^ 

3: X ^ t[ei] 

4: for i from i — 1 downto do 

5: X x"^ mod A 

6: if Cj 7^ then x ^ t[ei]x mod A. 



The precomputation cost is {2^ — 2)M(n), and if the digits Cj are random 
and uniformly distributed in Z fl [0,2^), then the modular multiplication at 
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step E] of BaseKExp is performed with probability 1 — 2~^. If e has n bits, 
the number of loops is about n/k. Ignoring the squares at step [5] whose total 
cost depends on ki ^ n (independent of k), the total expected cost in terms 
of multiplications modulo is: 

2^ -2 + n{l-2^^)/k. 

For k = 1 this formula gives n/2; for k = 2 it gives 3n/8 + 2, which is faster 
for n > 16; for k = 3 it gives 7n/24 + 6, which is faster than the k = 2 formula 
for n > 48. When n is large, the optimal value of k satisfies /c^2^ n/ln2. 
A minor disadvantage of this algorithm is its memory usage, since G(2'^) 
precomputed entries have to be stored. This is not a serious problem if we 
choose the optimal value of k (or a smaller value), because then the number 
of precomputed entries to be stored is o{n). 

Example: consider the exponent e = 3 499 211612. Algorithm BaseKExp 
performs 31 squarings independently of k, thus we count multiplications only. 
For A; = 2, we have e = (3100 210123 231 130)4: Algorithm BaseKExp 
performs two multiplications to precompute and a^, and 11 multiplications 
for the non-zero digits of e in base 4 (except for the leading digit), thus a total 
of 13. For = 3, we have e = (32 044 335 534)8, and the algorithm performs 6 
multiplications to precompute ^, and 9 multiplications in step [6], 

thus a total of 15. 

The last example illustrates two facts. First, if some digits (here 6 and 7) 
do not appear in the base-2'^ representation of e, then we do not need to 
precompute the corresponding powers of a. Second, when a digit is even, say 
Cj = 2, instead of doing three squarings and multiplying by a^, we could do 
two squarings, multiply by a, and perform a last squaring. These considera- 
tions lead to Algorithm BaseKExpOdd. 
The correctness of steps [7H9] follows from: 

a = (x a ) . 

On the previous example, with k = 3, this algorithm performs only four 
multiplications in step[T] (to precompute then a^, a^, aJ), then nine multi- 
plications in step [HI 

2.6.3 Sliding Window and Redundant Representation 

The "sliding window" algorithm is a straightforward generalization of Algo- 
rithm BaseKExpOdd. Instead of cutting the exponent into fixed parts of k 
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Algorithm 2.14 BaseKExpOdd 
Input: a, e, positive integers 
Output: X = a'^ mod N 

1: precompute then t[i] := a* mod for i odd, 1 < i < 2^ 

2: let {ciCi-i . . . eiCo) be the base 2^ representation of e, with 7^ 

3: write = 2^d with d odd 

4: X <^ X a;^'" mod iV 

5: for i from i — 1 downto do 

6: write Ci = 2"^d with d odd (if Cj = then m = d = 0) 

7: X ^ X mod JM 

8: if ei 7^ then x ^ t[d]x mod 

9: X ^ X mod A. 



bits each, the idea is to divide it into windows, where two adjacent windows 
might be separated by a block of zero or more 0-bits. The decomposition 
starts from the least significant bits. For example, with e = 3 499 211 612, or 
in binary: 



Here there are 9 windows (indicated by es,...,eo above) and we perform 
only 8 multiplications, an improvement of one multiplication over Algorithm 
BaseKExpOdd. On average, the sliding window base 2^' algorithm leads 
to about n/ [k + 1) windows instead of n/k with fixed windows. 

Another improvement may be feasible when division is feasible (and 
cheap) in the underlying group. For example, if we encounter three consecu- 
tive ones, say 111, in the binary representation of e, we may replace some bits 
by —1, denoted by 1, as in 1001. We have thus replaced three multiplications 
by one multiplication and one division, in other words x'' = x^ • x~^. For our 
running example, this gives: 



which has only 10 non-zero digits, apart from the leading one, instead of 15 
with bits and 1 only. The redundant representation with bits {0,1,1} is 
called the Booth representation. It is a special case of the Avizienis signed- 
digit redundant representation. Signed-digit representations exist in any base. 



1 101 00 001 001 00 Oil Oil 101 101 111 00. 




e = 



11 010 000 100 100 100 100 010 010 100 100, 
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For simplicity we have not distinguished between the cost of multiphca- 
tion and the cost of squaring (when the two operands in the multiphcation 
are known to be equal), but this distinction is significant in some applications 
(e.g., elliptic curve cryptography). Note that, when the underlying group op- 
eration is denoted by addition rather than multiplication, as is usually the 
case for abelian groups (such as groups defined over elliptic curves), then 
the discussion above applies with "multiplication" replaced by "addition", 
"division" by "subtraction", and "squaring" by "doubling". 

2.7 Chinese Remainder Theorem 

In applications where integer or rational results are expected, it is often 
worthwhile to use a "residue number system" (as in §2.1.3p and perform all 
computations modulo several small primes (or pairwise coprime integers). 
The final result can then be recovered via the Chinese Remainder Theorem 
(CRT). For such applications, it is important to have fast conversion routines 
from integer to modular representation, and vice versa. 

The integer to modular conversion problem is the following: given an 
integer x, and several pairwise coprime moduli rrii, 1 < i < k, how to effi- 
ciently compute Xi = X mod m, , for 1 < i < kl This is the remainder tree 
problem of Algorithm IntegerToRNS, which is also discussed in §2. 5. H and 
Exercise 11.351 

Algorithm 2.15 IntegerToRNS 

Input: integer x, moduli mi, m2, . . . , rrtk pairwise coprime, k>l 
Output: Xi = X mod rrii for 1 <i <k 
1: if A; < 2 then 

2: return Xi = x mod mi, . . . , Xk = x mod m^ 

3: £ ^ [k/2\ 

4: Ml -(r- mim2 ■ ■ ■ m£, M2 ^ m^+i ■ ■ ■ m/j > might be precomputed 

5: Xi, . . . ,Xi ^ IntegerToRNS (x mod Mi, mi, . . . , rrie) 

6: Xi+i, . . . ,Xk ^ IntegerToRNS (x mod M2, m^+i, . . . , m^). 



If all moduli rrii have the same size, and if the size n of x is com- 
parable to that of the product mim2---mk, the cost T{k) of Algorithm 
IntegerToRNS satisfies the recurrence T(n) = 2D(n/2) + 2T(ra/2), which 
yields T{n) = 0{M{n) log?7,). Such a conversion is therefore more expensive 
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than a multiplication or division, and is comparable in complexity terms to 
a base conversion or a gcd. 

The converse CRT reconstruction problem is the following: given the Xi, 
how to efficiently reconstruct the unique integer x, < x < mim2 ■ ■ -mk, 
such that X = Xi mod m^, for 1 < i < kl Algorithm RNSToInteger per- 
forms that conversion, where the values u,v at step [7] might be precomputed 
if several conversions are made with the same moduli, and step [11] ensures 
that the final result x lies in the interval [0, M1M2). 

Algorithm 2.16 RNSToInteger 

Input: residues Xt, < Xi < mi for 1 < i < k, mi pairwise coprime 
Output: < X < mim2 ■ ■ ■ m^ with x = Xi mod 
1: if A; = 1 then 

2: return Xi 
3: £ ^ [k/2\ 

4: Ml ^ mim^ ■ ■ ■ m£, M2 ■ ■ ■ m^ > might be precomputed 

5: Xi ^ RNSToInteger ([xi, . . . , x^], [mi, . . . , m^]) 
6: X2 ^ RNSToInteger ([x^+i, . . . , x^], [m^+i, . . . , mfc]) 
7: compute u, v such that uMi + VM2 = 1 > might be precomputed 

8: Ai ^ UX2 mod M2, A2 ^ vXi mod Mi 
9: X ^ AiMi + X2M2 
10: if X > M1M2 then 

11: X ^ X - M1M2. 



To see that Algorithm RNSToInteger is correct, consider an integer i, 
1 < i < k, and show that x = Xj mod m^. If = 1, it is trivial. Assume 
k > 2, and without loss of generality 1 < i < i. Since Mi is a multiple of 
mi, we have x mod m^ = (x mod Mi) mod mj, where 

X mod Ml = X2M2 mod Mi = VX1M2 mod Mi = ATi mod Mi, 

and the result follows from the induction hypothesis that Xi = Xi mod mj. 

Like IntegerToRNS, Algorithm RNSToInteger costs 0(M(n)logn) 
for M = mim2 ■ ■ ■ m^ of size n, assuming that the mj are of equal sizes. 

The CRT reconstruction problem is analogous to the Lagrange polynomial 
interpolation problem: find a polynomial of minimal degree interpolating 
given values Xj at k points mj. 
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A "flat" variant of the explicit Chinese remainder reconstruction is the 
following, taking for example k = 3: 

X = XiXi + X2X2 + A3X3, 

where Aj = 1 mod m^, and Aj = mod rrij for j 7^ i. In other words, Aj is 
the reconstruction of Xi = 0, . . . , = 0, Xj = 1, Xj+i = 0, . . . , = 0. For 
example, with mi = 11, m2 = 13 and rris = 17 we get: 

X = 221x1 + 1496x2 + 715x3. 

To reconstruct the integer corresponding to Xi = 2, X2 = 3, X3 = 4, we 
get X = 221 ■ 2 + 1496 ■ 3 + 715 ■ 4 = 7790, which after reduction modulo 
11 ■ 13 ■ 17 = 2431 gives 497. 

2.8 Exercises 

Exercise 2.1 In ^2.1.31 we considered the representation of nonnegative integers 
using a residue number system. Show that a residue number system can also 
be used to represent signed integers, provided their absolute values are not too 
large. (Specifically, if relatively prime moduli mi, m2, . . . , are used, and B = 
17111712 ■ ■ ■ rnk, the integers x should satisfy \x\ < B/2.) 

Exercise 2.2 Suppose two nonnegative integers x and y are represented by their 
residues modulo a set of relatively prime moduli mi, m2, . . . , m^ as in §2.1.31 Con- 
sider the comparison problem: is x < y? Is it necessary to convert x and y back to 
a standard (non-CRT) representation in order to answer this question? Similarly, 
if a signed integer x is represented as in Exercise 12. 1^ consider the sign detection 
problem: is x < 0? 

Exercise 2.3 Consider the use of redundant moduli in the Chinese remainder 
representation. In other words, using the notation of Exercise 12.21 consider the 
case that x could be reconstructed without using all the residues. Show that this 
could be useful for error detection (and possibly error correction) if arithmetic 
operations are performed on unreliable hardware. 

Exercise 2.4 Consider the two complexity bounds 0{M{dlog{Nd))) and 
0{A4{d)M{logN)) given at the end of §2.1.51 Compare the bounds in three cases: 
(a) d <^ N; (b) d ~ A^; (c) d ^ N. Assume two subcases for the multiplication 
algorithm: (i) M{n) = 0{n^); (ii) M{n) = 0(n log n). (For the sake of simplicity, 
ignore any log log factors.) 
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Exercise 2.5 Show that, if a symmetric representation in [—N/2, N/2) is used 
in Algorithm Modular Add ( ^2.2p . then the probabihty that we need to add or 
subtract N is 1/4 if is even, and (1 — l/A^^)/4 if N is odd (assuming in both 
cases that a and b are uniformly distributed). 

Exercise 2.6 Write down the complexity of the Montgomery-Svoboda algorithm 
( §2.4.2^ page 67) for k steps. For k = 3, use van der Hoeven's relaxed Karatsuba 
multiplication |124| to save one M(n/3) product. 

Exercise 2.7 Assume you have an FFT algorithm computing products modulo 
2" + 1. Prove that, with some preconditioning, you can perform a division with 
remainder of a 2n-bit integer by an n-bit integer as fast as 1.5 multiplications of 
n bits by n bits. 

Exercise 2.8 Assume you know p{x) mod (x^^ — 1) and p{x) mod (x"^ — 1), where 
p{x) G F[x] has degree n — 1, and ni > n2, and F is a field. Up to which value of 
n can you uniquely reconstruct p? Design a corresponding algorithm. 

Exercise 2.9 Consider the problem of computing the Fourier transform of a vec- 
tor a = [aQ, ai, . . . , ax-i], defined in Eqn. (|2.ip . when the size K is not a power 
of two. For example, K might be an odd prime or an odd prime power. Can you 
find an algorithm to do this in 0{K log K) operations? 

Exercise 2.10 Consider the problem of computing the cyclic convolution of two 
-fC-vectors, where K is not a power of two. (For the definition, with K replaced by 
N, see ^3.3.11 ) Show that the cyclic convolution can be computed using FFTs on 
2^^ points for some suitable A, or by using DFTs on K points (see Exercise 12. 9p . 
Which method is better? 

Exercise 2.11 Devise a parallel version of Algorithm Multiplelnversion as out- 
lined in ^2.5.11 Analyse its time and space complexity. Try to minimise the num- 
ber of parallel processors required while achieving a parallel time complexity of 
0(logA:). 

Exercise 2.12 Analyse the complexity of the algorithm outlined at the end of 
^2.5.1l to compute 1/x mod Ni, . . . ,1/x mod N^, when all the Ni have size n, and 
X has size i. For which values of n,i is it faster than the naive algorithm which 
computes all modular inverses separately? [Assume M{n) is quasi-linear, and 
neglect multiplicative constants.] 

Exercise 2.13 Write a RightToLeftBinaryExp algorithm and compare it with 
Algorithm LeftToRightBinaryExp of gMlH 
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Exercise 2.14 Investigate heuristic algorithms for obtaining close-to-optimal ad- 
dition (or multiplication) chains when the cost of a general addition a + b (or 
multiplication a • 5) is A times the cost of duplication a + a (or squaring a • a), 
and A is some fixed positive constant. (This is a reasonable model for modular 
exponentiation, because multiplication mod N is generally more expensive than 
squaring mod N. It is also a reasonable model for operations in groups defined 
by elliptic curves, since in this case the formulae for addition and duplication are 
usually different and have different costs.) 

2.9 Notes and References 

Several number-theoretic algorithms make heavy use of modular arithmetic, in 
particular integer factorization algorithms (for example: Pollard's p algorithm and 
the elliptic curve method). 

Another important application of modular arithmetic in computer algebra is 
computing the roots of a univariate polynomial over a finite field, which requires 
efficient arithmetic over Fp[j;]. See for example the excellent book "MCA" by von 
zur Gathen and Gerhard |100| . 

We say in §2.1.3l that residue number systems can only be used when factors 
into N1N2 ■ ■ this is not quite true, since Bernstein and Sorenson show in [24j how 
to perform modular arithmetic using a residue number system. 

For notes on the Kronecker-Schonhage trick, see §1.91 

Barrett's algorithm is described in [14], which also mentions the idea of us- 
ing two short products. The original description of Montgomery's REDC algo- 
rithm is [170] . It is now widely used in several applications. However, only a few 
authors considered using a reduction factor which is not of the form /?", among 
them McLaughlin |161j and Mihailescu |165j . The Montgomery-Svoboda algorithm 
( §2.4.2|) is also called "Montgomery tail tayloring" by Hars |113j . who attributes 
Svoboda's algorithm — more precisely its variant with the most significant word 
being (3 — 1 instead of /? — to Quisquater. The folding optimization of REDG 
described in ^2.4.21 (Subquadratic Montgomery Reduction) is an LSB-extension 
of the algorithm described in the context of Barrett's algorithm by Hasenplaugh, 
Gaubatz and Gopal |118| . Amongst the algorithms not covered in this book, we 
mention the "bipartite modular multiplication" of Kaihara and Takagi |134J, which 
involves performing both MSB- and LSB-division in parallel. 

The description of McLaughlin's algorithm in ^2.4.31 follows |16H Variation 2]; 
McLaughlin's algorithm was reformulated in a polynomial context by 
Mihailescu [165] . 

Many authors have proposed FFT algorithms, or improvements of such algo- 
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rithms, and applications such as fast computation of convolutions. Some references 
are Aho, Hopcroft and Ullman [3]; Nussbaumer [177j : Borodin and Munro [35], 
who describe the polynomial approach; Van Loan |223j for the linear algebra ap- 
proach; and Pollard |186| for the FFT over finite fields. Rader [188] considered the 
case where the number of data points is a prime, and Winograd [231] generalised 
Rader's algorithm to prime powers. Bluestein's algorithm [30] is also applicable 
in these cases. In Bernstein [221 §23] the reader will find some historical remarks 
and several nice applications of the FFT. 

The Schonhage-Strassen algorithm first appeared in [200] . Recently Fiirer [98] 
has proposed an integer multiplication algorithm that is asymptotically faster than 
the Schonhage-Strassen algorithm. Fiirer's algorithm almost achieves the conjec- 
tured best possible 0(nlogn) running time. 

Concerning special moduli, Percival considers in [184] the case N = a ziz b 
where both a and b are highly composite; this is a generalization of the case 
A'^ = ± 1. The pseudo-Mersenne primes of ^2.4.41 are recommended by the 
National Institute of Standards and Technology (NIST) [75]. See also the book by 
Hankerson, Menezes and Vanstone [110] . 

Algorithm Multiplelnversion — also known as "batch inversion" — is due 
to Montgomery [171] . The application of Barrett's algorithm for an implicitly 
invariant divisor was suggested by Granlund. 

Modular exponentiation and cryptographic algorithms are described in much 
detail in the book by Menezes, van Oorschot and Vanstone [1621 Chapter 14]. 
A detailed description of the best theoretical algorithms, with references, can be 
found in Bernstein [18]. When both the modulus and base are invariant, mod- 
ular exponentiation with /c-bit exponent and n-bit modulus can be performed 
in time 0{{k/ log k)M{n)), after a precomputation of 0(A;/logA;) powers in time 
0{kM{n)). Take for example b = 2^1* in Note 14.112 and Algorithm 14.109 
of [162] . with tlogt ~ k, where the powers a*' mod N for < i < f are precom- 
puted. An algorithm of same complexity using a DBNS (Double-Base Number 
System) was proposed by Dimitrov, Jullien and Miller [86], however with a larger 
table of 0(A;^) precomputed powers. 

Original papers on Booth receding, SRT division, etc., are reprinted in the 
book by Swartzlander [213] . 

A quadratic algorithm for CRT reconstruction is discussed in [73] ; Moller gives 
some improvements in the case of a small number of small moduli known in ad- 
vance |168] . Algorithm IntegerToRNS can be found in Borodin and Moenck [34] . 
The explicit Chinese Remainder Theorem and its applications to modular expo- 
nentiation are discussed by Bernstein and Sorenson in [24] . 



Chapter 3 

Floating-Point Arithmetic 



This chapter discusses the basic operations — addition, subtrac- 
tion, multiphcation, division, square root, conversion — on ar- 
bitrary precision floating-point numbers, as Chapter [T] does for 
arbitrary precision integers. More advanced functions hke ele- 
mentary and special functions are covered in Chapter |H This 
chapter largely follows the IEEE 754 standard, and extends it in 
a natural way to arbitrary precision; deviations from IEEE 754 
are explicitly mentioned. By default IEEE 754 refers to the 2008 
revision, known as IEEE 754-2008; we write IEEE 754-1985 when 
we explicitly refer to the 1985 initial standard. Topics not dis- 
cussed here include: hardware implementations, fixed-precision 
implementations, special representations. 

3.1 Representation 

The classical non-redundant representation of a floating-point number x in 
radix /3 > 1 is the following (other representations are discussed in §3.81) : 

X = {-ly -m- (3', (3.1) 

where (—1)'^, s G {0, 1}, is the sign, m > is the significand, and the integer 
e is the exponent of x. In addition, a positive integer n defines the precision 
of X, which means that the significand m contains at most n significant digits 
in radix /3. 

An important special case is m = representing zero. In this case the sign 
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s and exponent e are irrelevant and may be used to encode other information 
(see for example §3.1.3p . 

For m 7^ 0, several semantics are possible; the most common ones are: 

• < m < 1, then (]'^~^ < \x\ < (5^. In this case m is an integer 
multiple of We say that the unit in the last place of x is Z?*^"", and 
we write ulp(x) = /S^""". For example, x = 3.1416 with radix /3 = 10 
is encoded by m = 0.31416 and e = 1. This is the convention that we 
will use in this chapter; 

• 1 < m < (3, then (3^ < \x\ < and ulp(x) = With radix 
ten the number x = 3.1416 is encoded by m = 3.1416 and e = 0. This 
is the convention adopted in the IEEE 754 standard; 

• we can also use an integer significand < m < /S", then < 

< Z?*^"*"", and ulp(x) = /J*^. With radix ten the number x = 3.1416 is 
encoded by m = 31416 and e = —4. 

Note that in the above three cases, there is only one possible representation 
of a non-zero floating-point number: we have a canonical representation. 
In some applications, it is useful to relax the lower bound on nonzero m, 
which in the three cases above gives respectively 0<m<l,0<m< 
/3, and < m < with m an integer multiple of /S*^"", Z?^"*"^"", and 1 
respectively. In this case, there is no longer a canonical representation. For 
example, with an integer significand and a precision of 5 digits, the number 
3.1400 might be encoded by (m = 31400, e = -4), (m = 03140, e = -3), or 
(m = 00314, e = —2). This non-canonical representation has the drawback 
that the most significant non-zero digit of the significand is not known in 
advance. The unique encoding with a non-zero most significant digit, i.e., 
(m = 31400, e = —4) here, is called the normalised — or simply normal — 
encoding. 

The significand is also sometimes called the mantissa or fraction. The 
above examples demonstrate that the different significand semantics corre- 
spond to different positions of the decimal (or radix /3) point, or equivalently 
to different biases of the exponent. We assume in this chapter that both the 
radix /3 and the significand semantics are implicit for a given implementation, 
thus are not physically encoded. 

The words "base" and "radix" have similar meanings. For clarity we 
reserve "radix" for the constant /3 in a floating-point representation such 
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as (13. ip . The significand m and exponent e might be stored in a different 
base, as discussed below. 

3.1.1 Radix Choice 

Most ffoating-point implementations use radix /3 = 2 or a power of two, 
because this is convenient and efficient on binary computers. For a radix /3 
which is not a power of 2, two choices are possible: 

• store the significand in base /3, or more generally in base (3^ for an 
integer A; > 1. Each digit in base /3'' requires |"/clg/3] bits. With such a 
choice, individual digits can be accessed easily. With /3 = 10 and k = 1, 
this is the "Binary Coded Decimal" or BCD encoding: each decimal 
digit is represented by 4 bits, with a memory loss of about 17% (since 
lg(10)/4 0.83). A more compact choice is radix 10^, where 3 decimal 
digits are stored in 10 bits, instead of in 12 bits with the BCD format. 
This yields a memory loss of only 0.34% (since lg(1000)/10 ^ 0.9966); 

• store the significand in binary. This idea is used in Intel's Binary- 
Integer Decimal (BID) encoding, and in one of the two decimal encod- 
ings in IEEE 754-2008. Individual digits can not be accessed directly, 
but one can use efficient binary hardware or software to perform oper- 
ations on the significand. 

A drawback of the binary encoding is that, during the addition of two 
arbitrary-precision numbers, it is not easy to detect if the significand ex- 
ceeds the maximum value /S" — 1 (when considered as an integer) and thus 
if rounding is required. Either /3" is precomputed, which is only realistic if 
all computations involve the same precision n, or it is computed on the fly, 
which might result in increased complexity (see Chapter [T] and §2.6. ip . 

3.1.2 Exponent Range 

In principle, one might consider an unbounded exponent. In other words, the 
exponent e might be encoded by an arbitrary-precision integer (see Chap- 
ter [1]). This would have the great advantage that no underflow or overflow 
could occur (see below). However, in most applications, an exponent en- 
coded in 32 bits is more than enough: this enables us to represent values up 
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to about 10^^^'*^^ for /3 = 2. A result exceeding this value most proba- 
bly corresponds to an error in the algorithm or the implementation. Using 
arbitrary-precision integers for the exponent induces an extra overhead that 
slows down the implementation in the average case, and it usually requires 
more memory to store each number. 

Thus, in practice the exponent nearly always has a limited range Cmin < 
e < Gmax- We say that a floating-point number is representable if it can be 
represented in the form (— 1)"* ■ m ■ {3^ with e^ain < e < Cmax- The set of 
representable numbers clearly depends on the significand semantics. For the 
convention we use here, i.e., < m < 1, the smallest positive representable 
floating-point number is Z?*^""""^, and the largest one is — 

Other conventions for the significand yield different exponent ranges. For 
example the double-precision format — called binary64 in IEEE 754-2008 
— has Cmin = —1022, Cmax = 1023 for a significand in [1, 2); this corresponds 
to emin = —1021, Cmax = 1024 for a significand in [1/2, 1), and Cmin = —1074, 
Cmax = 971 for an integer significand in [2^^, 2^'^). 

3.1.3 Special Values 

With a bounded exponent range, if we want a complete arithmetic, we need 
some special values to represent very large and very small values. Very small 
values are naturally flushed to zero, which is a special number in the sense 
that its significand is m = 0, which is not normalised. For very large values, 
it is natural to introduce two special values — oo and +C)0, which encode large 
non-representable values. Since we have two infinities, it is natural to have 
two zeros —0 and -|-0, for example l/{—oo) = —0 and l/(-|-oo) = -|-0. This is 
the IEEE 754 choice. Another possibility would be to have only one infinity 
oo and one zero 0, forgetting the sign in both cases. 

An additional special value is Not a Number (NaN), which either repre- 
sents an uninitialised value, or is the result of an invalid operation like a/— T 
or (-I-cxd) — (-|-oo). Some implementations distinguish between different kinds 
of NaN, in particular IEEE 754 defines signalling and quiet NaNs. 

3.1.4 Subnormal Numbers 

Subnormal numbers are required by the IEEE 754 standard, to allow what is 
called gradual underflow between the smallest (in absolute value) non-zero 
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normalised numbers and zero. We first explain what subnormal numbers are; 
then we will see why they are not necessary in arbitrary precision. 

Assume we have an integer significand in where n is the pre- 

cision, and an exponent in [emin, Cmax]- Write 77 = Z?*^""". The two smallest 
positive normalised numbers are x = /3'"~^?7 and y = + l)r]. The dif- 

ference y — X equals 77, which is tiny compared to x. In particular, y — x can 
not be represented exactly as a normalised number (assuming > 1) and 
will be rounded to zero in "rounding to nearest" mode ( §3.1.9p . This has the 
unfortunate consequence that instructions like: 

if (y != x) then 
z = 1.0/(y - x); 

will produce a "division by zero" error when executing 1 . 0/ (y - x) . 

Subnormal numbers solve this problem. The idea is to relax the condition 
pn-i ^ ^ exponent Cmin- In other words, we include all numbers of 

the form m ■ for 1 < m < I3^~^ in the set of valid floating-point numbers. 
One could also permit m = 0, and then zero would be a subnormal number, 
but we continue to regard zero as a special case. 

Subnormal numbers are all positive integer multiples of ±77, with a mul- 
tiplier m, 1 < m < /3"~^. The difference between x = [i^~^r] and y = 
+ 1)?7 is now representable, since it equals rj, the smallest positive 
subnormal number. More generally, all floating-point numbers are multiples 
of 7], likewise for their sum or difference (in other words, operations in the 
subnormal domain correspond to fixed-point arithmetic). If the sum or dif- 
ference is non-zero, it has magnitude at least rj, thus can not be rounded to 
zero. Thus the "division by zero" problem mentioned above does not occur 
with subnormal numbers. 

In the IEEE 754 double-precision format — called binary64 in IEEE 
754-2008 — the smallest positive normal number is 2~^°^^, and the smallest 
positive subnormal number is 2~^°^^. In arbitrary precision, subnormal num- 
bers seldom occur, since usually the exponent range is huge compared to the 
expected exponents in a given application. Thus the only reason for imple- 
menting subnormal numbers in arbitrary precision is to provide an extension 
of IEEE 754 arithmetic. Of course, if the exponent range is unbounded, then 
there is absolutely no need for subnormal numbers, because any nonzero 
floating-point number can be normalised. 



90 



Modern Computer Arithmetic, version 0.5.1 of April 28, 2010 



3.1.5 Encoding 

The encoding of a floating-point number x = (— 1)'^ ■ m ■ is the way the 
values s, m and e are stored in the computer. Remember that /3 is implicit, 
i.e., is considered fixed for a given implementation; as a consequence, we do 
not consider here mixed radix operations involving numbers with different 
radices (3 and /?'. 

We have already seen that there are several ways to encode the significand 
m when /3 is not a power of two: in base-/?'^ or in binary. For normal numbers 
in radix 2, i.e., 2"~^ < m < 2", the leading bit of the significand is necessarily 
1, thus one might choose not the encode it in memory, to gain an extra bit of 
precision. This is called the implicit leading bit, and it is the choice made in 
the IEEE 754 formats. For example the double-precision format has a sign 
bit, an exponent field of 11 bits, and a significand of 53 bits, with only 52 
bits stored, which gives a total of 64 stored bits: 



sign 


(biased) exponent 


significand 


(1 bit) 


(11 bits) 


(52 bits, plus implicit leading bit) 



A nice consequence of this particular encoding is the following. Let x be 
a double-precision number, neither subnormal, ±oo, NaN, nor the largest 
normal number in absolute value. Consider the 64-bit encoding of 64- 
bit integer, with the sign bit in the most significant bit, the exponent bits in 
the next most significant bits, and the explicit part of the significand in the 
low significant bits. Adding 1 to this 64-bit integer yields the next double- 
precision number to x, away from zero. Indeed, if the significand m is smaller 
than 2^^ — 1, m becomes m + 1 which is smaller than 2^^. If m = 2^^ — 1, 
then the lowest 52 bits are all set, and a carry occurs between the significand 
field and the exponent field. Since the significand field becomes zero, the 
new significand is 2^^, taking into account the implicit leading bit. This 
corresponds to a change from (2^^ — 1) ■ 2^ to 2^^ ■ 2*^+^, which is exactly the 
next number away from zero. Thanks to this consequence of the encoding, an 
integer comparison of two words (ignoring the actual type of the operands) 
should give the same result as a floating-point comparison, so it is possible 
to sort normal positive floating-point numbers as if they were integers of the 
same length (64-bit for double precision). 

In arbitrary precision, saving one bit is not as crucial as in fixed (small) 
precision, where one is constrained by the word size (usually 32 or 64 bits). 
Thus, in arbitrary precision, it is easier and preferable to encode the whole 
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significand. Also, note that having an "imphcit bit" is not possible in radix 
/3 > 2, since for a normal number the most significant digit might take several 
values, from 1 to /3 — 1. 

When the significand occupies several words, it can be stored in a linked 
list, or in an array (with a separate size field). Lists are easier to extend, but 
accessing arrays is usually more efficient because fewer memory references 
are required in the inner loops and memory locality is better. 

The sign s is most easily encoded as a separate bit field, with a non- 
negative significand. This is the sign-magnitude encoding. Other possibilities 
are to have a signed significand, using either I's complement or 2's comple- 
ment, but in the latter special encoding is required for zero, if it is 
desired to distinguish -|-0 from —0. Finally, the exponent might be encoded 
gned word (for example, type long in the C language). 

3.1.6 Precision: Local, Global, Operation, Operand 

The different operands of a given operation might have different precisions, 
and the result of that operation might be desired with yet another precision. 
There are several ways to address this issue. 

• The precision, say n is attached to a given operation. In this case, 
operands with a smaller precision are automatically converted to preci- 
sion n. Operands with a larger precision might either be left unchanged, 
or rounded to precision n. In the former case, the code implementing 
the operation must be able to handle operands with different precisions. 
In the latter case, the rounding mode to shorten the operands must be 
specified. Note that this rounding mode might differ from that of the 
operation itself, and that operand rounding might yield large errors. 
Consider for example a = 1.345 and b = 1.234567 with a precision of 4 
digits. If b is taken as exact, the exact value of a — 6 equals 0.110433, 
which when rounded to nearest becomes 0.1104. If b is first rounded to 
nearest to 4 digits, we get b' = 1.235, and a — b' = 0.1100 is rounded 
to itself. 

• The precision n is attached to each variable. Here again two cases may 
occur. If the operation destination is part of the operation inputs, as 
in sub(c , a, b) , which means c ^ round(a — b), then the precision of 
the result operand c is known, thus the rounding precision is known in 



92 



Modern Computer Arithmetic, version 0.5.1 of April 28, 2010 



advance. Alternatively, if no precision is given for the result, one might 
choose the maximal (or minimal) precision from the input operands, or 
use a global variable, or request an extra precision parameter for the 
operation, as in c = sub (a, b, n). 

Of course, these different semantics are inequivalent, and may yield different 
results. In the following, we consider the case where each variable, including 
the destination variable, has its own precision, and no pre-rounding or post- 
rounding occurs. In other words, the operands are considered exact to their 
full precision. 

Rounding is considered in detail in §3.1.91 Here we define what we mean 
by the correct rounding of a function. 

Definition 3.1.1 Let a,b,... be floating-point numbers, f a mathematical 
function, n > 1 an integer, and o a rounding mode. We say that c is the 
correct roundingo/ /(a, b, . . .), and we write c = o„(/(a, b, ...)), if c is the 
floating-point number closest to f{a,b,...) in precision n and according to 
the given rounding mode. In case several numbers are at the same distance 
from f{a, b, . . .), the rounding mode must define in a deterministic way which 
one is "the closest". When there is no ambiguity, we omitn and write simply 
c = o(/(a,6, . . .)). 

3.1.7 Link to Integers 

Most floating-point operations reduce to arithmetic on the significands, which 
can be considered as integers as seen at the beginning of this section. There- 
fore efficient arbitrary precision floating-point arithmetic requires efficient 
underlying integer arithmetic (see Chapter [T]). 

Conversely, floating-point numbers might be useful for the implementa- 
tion of arbitrary precision integer arithmetic. For example, one might use 
hardware floating-point numbers to represent an arbitrary precision integer. 
Indeed, since a double-precision floating-point number has 53 bits of pre- 
cision, it can represent an integer up to 2^^ — 1, and an integer A can be 
represented as: A = an-i/3"'~^ + ■ ■ ■ + + ■ ■ ■ + ai/3 -\- oq, where (3 = 2^^, 
and the are stored in double-precision data types. Such an encoding was 
popular when most processors were 32-bit, and some had relatively slow in- 
teger operations in hardware. Now that most computers are 64-bit, this 
encoding is obsolete. 
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Floating-point expansions are a variant of the above. Instead of storing 
tti and having /3* imphcit, the idea is to directly store ai/3\ Of course, this 
only works for relatively small i, i.e., whenever a^/S* does not exceed the 
format range. For example, for IEEE 754 double precision, the maximal 
integer precision is 1024 bits. (Alternatively, one might represent an integer 
as a multiple of the smallest positive number 2"^°''^, with a corresponding 
maximal precision of 2098 bits.) 

Hardware floating-point numbers might also be used to implement the 
Fast Fourier Transform (FFT), using complex numbers with floating-point 
real and imaginary part (see §3.3.ip . 

3.1.8 Ziv's Algorithm and Error Analysis 

A rounding boundary is a point at which the rounding function o(x) is dis- 
continuous. 

In fixed precision, for basic arithmetic operations, it is sometimes possi- 
ble to design one-pass algorithms that directly compute a correct rounding. 
However, in arbitrary precision, or for elementary or special functions, the 
classical method is to use Ziv's algorithm: 

1. we are given an input x, a target precision n, and a rounding mode; 

2. compute an approximation y with precision m > n, and a correspond- 
ing error bound e such that \y — f{x)\ < e; 

3. if [y — 6,y + e] contains a rounding boundary, increase m and go to 
step 2; 

4. output the rounding of y, according to the given rounding mode. 

The error bound e at step 2 might be computed either a priori, i.e., from 
X and n only, or dynamically, i.e., from the different intermediate values 
computed by the algorithm. A dynamic bound will usually be tighter, but 
will require extra computations (however, those computations might be done 
in low precision). 

Depending on the mathematical function to be implemented, one might 
prefer an absolute or a relative error analysis. When computing a relative 
error bound, at least two techniques are available: one might express the 
errors in terms of units in the last place (ulps), or one might express them 
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in terms of true relative error. It is of course possible in a given analysis to 
mix both kinds of errors, but in general one loses a constant factor — the 
radix (3 — when converting from one kind of relative error to the other kind. 

Another important distinction is forward versus backward error analysis. 
Assume we want to compute y = f{x). Because the input is rounded, and/or 
because of rounding errors during the computation, we might actually com- 
pute y' ~ f{x'). Forward error analysis will bound \y' — y\ if we have a bound 
on |x' — x| and on the rounding errors that occur during the computation. 

Backward error analysis works in the other direction. If the computed 
value is y', then backward error analysis will give us a number 6 such that, 
for some x' in the ball \x' — x\ < S, we have y' = f{x'). This means that the 
error is no worse than might have been caused by an error of 6 in the input 
value. Note that, if the problem is ill-conditioned, 6 might be small even if 
\y' — y\ is large. 

In our error analyses, we assume that no overflow or underflow occurs, 
or equivalently that the exponent range is unbounded, unless the contrary is 
explicitly stated. 



3.1.9 Rounding 

There are several possible definitions of rounding. For example probabilistic 
rounding — also called stochastic rounding — chooses at random a rounding 
towards +oo or — oo for each operation. The IEEE 754 standard defines four 
rounding modes: towards zero, +oo, — oo and to nearest (with ties broken to 
even). Another useful mode is "rounding away from zero", which rounds in 
the opposite direction from zero: a positive number is rounded towards +oo, 
and a negative number towards — oo. If the sign of the result is known, all 
IEEE 754 rounding modes might be converted to either rounding to nearest, 
rounding towards zero, or rounding away from zero. 



Theorem 3.1.1 Consider a floating-point system with radix (3 and preci- 
sion n. Let u be the rounding to nearest of some real x, then the following 
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inequalities hold: 











\u — 


x\ 


< 


^uIp(m) 


\u — 


x\ 


< 


2' ' ' 


\u — 


x\ 


< 


2^ 1 1 



Proof. For x = 0, necessarily n = 0, and the statement holds. Without loss 
of generality, we can assume u and x positive. The first inequality is the 
definition of rounding to nearest, and the second one follows from u1p(m) < 
/3^~"'u. (In the case /3 = 2, it gives |m — x| < 2~"|m|.) For the last inequality, 
we distinguish two cases: if m < x, it follows from the second inequality. If 
X < u, then if x and u have the same exponent, i.e., < x < u < 13'^, 
then u1p(m) = /J*^"" < P^^"'x. The only remaining case is < x < u = (3'^. 
Since the fioating-point number preceding /J*^ is — and x was 

rounded to nearest, we have |u — x| < /3'^~"/2 here too. □ 

In order to round according to a given rounding mode, one proceeds as 
follows: 

1. first round as if the exponent range was unbounded, with the given 
rounding mode; 

2. if the rounded result is within the exponent range, return this result; 

3. otherwise raise the "underfiow" or "overfiow" exception, and return ±0 
or ±00 accordingly. 

For example, assume radix 10 with precision 4, Cmax = 3, with x = 0.9234-10^, 
y = 0.7656 ■ 10^. The exact sum x + y equals 0.99996 ■ 10^. With rounding 
towards zero, we obtain 0.9999 • 10^, which is represent able, so there is no 
overfiow. With rounding to nearest, x + y rounds to 0.1000 -10^, where the 
exponent 4 exceeds Cmax = 3, so we get +00 as the result, with an overfiow. 
In this model, overfiow depends not only on the operands, but also on the 
rounding mode. 

The "round to nearest" mode of IEEE 754 rounds the result of an opera- 
tion to the nearest representable number. In case the result of an operation 
is exactly halfway between two consecutive numbers, the one with least sig- 
nificant bit zero is chosen (for radix 2). For example I.IOII2 is rounded with 
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a precision of 4 bits to I.IIO2, as is I.IIOI2. However this rule does not 
readily extend to an arbitrary radix. Consider for example radix /3 = 3, a 
precision of 4 digits, and the number 1212.111 . . .3. Both I2123 and I22O3 
end in an even digit. The natural extension is to require the whole significand 
to be even, when interpreted as an integer in — 1]- In this setting, 

(1212.111 . . .)3 rounds to (1212)3 = SOiq. (Note that /3" is an odd number 
here.) 

Assume we want to correctly round a real number, whose binary expan- 
sion is 2'^ • 0.1^2 •• • fen^n.+i . . ., to n bits. It is enough to know the values of 
r = 6„+i — called the round bit — and that of the sticky bit s, which is 
when bn+2bn+3 • • • is identically zero, and 1 otherwise. Table 13.11 shows how 
to correctly round given r, s, and the given rounding mode; rounding to ±00 
being converted to rounding towards zero or away from zero, according to 
the sign of the number. The entry is for round to nearest in the case of 
a tie: if 6„ = it will be unchanged, but if 6„ = 1 we add 1 (thus changing 
5„ to 0). 
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towards zero 


to nearest 


away from zero 




















1 
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1 








bn 
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1 





1 


1 



Table 3.1: Rounding rules according to the round bit r and the sticky bit s: 
a "0" entry means truncate (round towards zero), a "1" means round away 
from zero (add 1 to the truncated significand). 

In general, we do not have an infinite expansion, but a finite approxima- 
tion y of an unknown real value x. For example, y might be the result of an 
arithmetic operation such as division, or an approximation to the value of a 
transcendental function such as exp. The following problem arises: given the 
approximation y, and a bound on the error \y — x\, is it possible to determine 
the correct rounding of x7 Algorithm RoundingPossible returns true if 
and only if it is possible. 

Proof. Since rounding is monotonic, it is possible to determine o(a;) exactly 
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Algorithm 3.1 RoundingPossible 

Input: a floating-point number y = 0.1?/2 • • • Vm, a precision n < m, an error 

bound e = 2-^ a rounding mode o 
Output: true when o„(a;) can be determined for ly — x| < e 

it k < n + 1 then return false 

if o is to nearest then r ^ 1 else r ^ 

if Vn+i = f and yn+2 = ■ ■ ■ = = then s ^ else s ^ 1 

if s = 1 then return true else return false. 



when o(?/ — 2~^) = o[y + 2~^'), or in other words when the interval [y — 2~'^, 
y + 2~^] contains no rounding boundary (or only one a.s y — or y + 2~^). 

If A; < n + 1, then the interval [— 2"^^, 2~'^] has width at least 2"", thus 
contains at least one rounding boundary in its interior, or two rounding 
boundaries, and it is not possible to round correctly. In the case of directed 
rounding (resp. rounding to nearest), if s = the approximation y is repre- 
sentable (resp. the middle of two representable numbers) in precision n, and it 
is clearly not possible to round correctly; if s = 1 the interval [y — 2~^, y + 2~^] 
contains at most one rounding boundary, and if so it is one of the bounds, 
thus it is possible to round correctly. □ 

The Double Rounding Problem 

When a given real value x is first rounded to precision m, then to precision 
n < m, we say that a "double rounding" occurs. The "double rounding 
problem" happens when this latter value differs from the direct rounding of 
X to the smaller precision n, assuming the same rounding mode is used in all 
cases, i.e., when: 

m 

The double rounding problem does not occur for directed rounding modes. 
For these rounding modes, the rounding boundaries at the larger precision 
m refine those at the smaller precision n, thus all real values x that round to 
the same value y at precision m also round to the same value at precision n, 
namely o„(?/). 

Consider the decimal value x = 3.14251. Rounding to nearest to 5 digits, 
we get y = 3.1425; rounding y to nearest-even to 4 digits, we get 3.142, 
whereas direct rounding of x would give 3.143. 
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With rounding to nearest mode, the double rounding problem only occurs 
when the second rounding involves the even-rule, i.e., the value y = o^(x) 
is a rounding boundary at precision n. Otherwise y has distance at least 
one ulp (in precision m) from a rounding boundary at precision n, and since 
\y — x\ is bounded by half an ulp (in precision m), all possible values for x 
round to the same value in precision n. 

Note that the double rounding problem does not occur with all ways of 
breaking ties for rounding to nearest (Exercise 13.21) . 



3.1.10 Strategies 

To determine the correct rounding of /(x) with n bits of precision, the best 
strategy is usually to first compute an approximation y to /(x) with a working 
precision of m = n + h bits, with h relatively small. Several strategies are 
possible in Ziv's algorithm ( §3.1.8p when this first approximation y is not 
accurate enough, or too close to a rounding boundary: 

• compute the exact value of f{x), and round it to the target precision n. 
This is possible for a basic operation, for example f{x) = x"^, or more 
generally f{x, y) = x + yoTxxy. Some elementary functions may 
yield an exactly representable output too, for example \/2.25 = 1.5. 
An "exact result" test after the first approximation avoids possibly 
unnecessary further computations; 

• repeat the computation with a larger working precision m' = n+h' . As- 
suming that the digits of f{x) behave "randomly" and that \ f'{x) / f{x)\ 
is not too large, using h' K.\gn is enough to guarantee that rounding is 
possible with probability 1 — 0{l/n). If rounding is still not possible, 
because the h' last digits of the approximation encode or 2^* — 1, one 
can increase the working precision and try again. A check for exact 
results guarantees that this process will eventually terminate, provided 
the algorithm used has the property that it gives the exact result if this 
result is representable and the working precision is high enough. For 
example, the square root algorithm should return the exact result if it 
is representable (see Algorithm FPSqrt in §3.51 and also Exercise 13. 3p . 
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3.2 Addition, Subtraction, Comparison 

Addition and subtraction of floating-point numbers operate from the most 
significant digits, wliereas integer addition and subtraction start from the 
least significant digits. Thus completely different algorithms are involved. 
Also, in the floating-point case, part or all of the inputs might have no 
impact on the output, except in the rounding phase. 

In summary, floating-point addition and subtraction are more difficult to 
implement than integer addition/subtraction for two reasons: 

• scaling due to the exponents requires shifting the significands before 
adding or subtracting them. In principle one could perform all opera- 
tions using only integer operations, but this might require huge integers, 
for example when adding 1 and 2~^°°°; 

• as the carries are propagated from least to most significant digits, one 
may have to look at arbitrarily low input digits to guarantee correct 
rounding. 

In this section, we distinguish between "addition" , where both operands 
to be added have the same sign, and "subtraction", where the operands to 
be added have different signs (we assume a sign-magnitude representation). 
The case of one or both operands zero is treated separately; in the description 
below we assume that all operands are nonzero. 

3.2.1 Floating-Point Addition 

Algorithm FPadd adds two binary floating-point numbers b and c of the 
same sign. More precisely, it computes the correct rounding of 6 -|- c, with 
respect to the given rounding mode o. For the sake of simplicity, we assume 
b and c are positive, 6 > c > 0. It will also be convenient to scale b and c so 
that 2"~^ < 6 < 2" and 2™~^ < c < 2™, where n is the desired precision of 
the output, and m < n. Of course, if the inputs b and c to Algorithm FPadd 
are scaled by 2'', then to compensate for this the output must be scaled by 
2~^. We assume that the rounding mode is to nearest, towards zero, or away 
from zero (rounding to ±oo reduces to rounding towards zero or away from 
zero, depending on the sign of the operands). 

The values of round(o, r, s) and round2(o, a, t) are given in Table [221 We 
have simplified some of the expressions given in Table 13.21 For example, in 
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Algorithm 3.2 FPadd 

Input: 6 > c > two binary floating-point numbers, a precision n such that 

2n-i < 5 < 2", and a rounding mode o 
Output: a floating-point number a of precision n and scale e such that 

a -2^ = o{b + c) 

1: split b into bh + bi where bh contains the n most signiflcant bits of b. 
2: split c into Ch + Q where Ch contains the most signiflcant bits of c, and 
ulp(c/i) = ulp(6/i) = 1 > might be zero 

3: ah ^ bh + Ch, e ^ 

4: (c, r, s) 6^ + Q > see the text 

5: (a, t) ^ {ah + c + round(o, r, s), eic.) > for t see Table 13.21 (upper) 

6: if a > 2" then 

7: (a, e) ^ (round2(o, a, t), e + 1) > see Table 13.21 (lower) 

8: if a = 2" then (a,e) ^ (a/2,e+ 1) 
9: return (a, e). 



the upper half of the table, r V s means if r = s = 0, and 1 otherwise. 
In the lower half of the table, 2 [(a + 1)/4J is (a — l)/2 if a = 1 mod 4, and 
(a + l)/2 if a = 3mod 4. 

At step H] of Algorithm FPadd, the notation (c, r, s) 6^ + q means 
that c is the carry bit of 6^ + q, r the round bit, and s the sticky bit; 
c,r,s E {0, 1}. For rounding to nearest, t = sign(6 + c — a) is a ternary value 
which is respectively positive, zero, or negative when a is smaller than, equal 
to, or larger than the exact sum b + c. 

Theorem 3.2.1 Algorithm FPadd is correct. 

Proof. We have 2""-^ < 6 < 2" and 2"*-^ < c < 2"*, with m < n. Thus bh 
and Ch are the integer parts of b and c, bg and q their fractional parts. Since 
6 > c, we have Ch < bh and 2"-^ < < 2" - 1, thus 2""^ < a,, < 2"+^ - 2, 
and at step [5], 2""^ < o < 2""*"^. If a < 2", a is the correct rounding of 
b + c. Otherwise, we face the "double rounding" problem: rounding a down 
to n bits will give the correct result, except when a is odd and rounding is to 
nearest. In that case, we need to know if the flrst rounding was exact, and 
if not in which direction it was rounded; this information is encoded in the 
ternary value t. After the second rounding, we have 2"~^ < a < 2"'. q 

Note that the exponent Ca of the result lies between Cb (the exponent of b, 
here we considered the case Cb = n) and Cf, + 2. Thus no underflow can occur 
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0/1 (even rounding) 
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7^0 
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o a mod 2 t 


round2(o, a, t) 


any any 
towards 1 any 
away from 1 any 
to nearest 1 
to nearest 1 ±1 


a/2 
(a-l)/2 
(a + l)/2 
2L(a + l)/4j 
{a + t)/2 



Table 3.2: Rounding rules for addition. 



in an addition. The case Cq = + 2 can occur only when the destination 
precision is less than that of the operands. 



3.2.2 Floating-Point Subtraction 

Floating-point subtraction (of positive operands) is very similar to addition, 
with the difference that cancellation can occur. Consider for example the 
subtraction 6.77823 — 5.98771. The most significant digit of both operands 
disappeared in the result 0.79052. This cancellation can be dramatic, as 
in 6.7782357934 - 6.7782298731 = 0.0000059203, where six digits were can- 
celled. 
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Two approaches are possible, assuming n result digits are wanted, and 
the exponent difference between the inputs is d: 

• subtract the n — d most-significant digits of the smaller operand from 
the n most- significant digits of the larger operand. If the result has n—e 
digits with e > 0, restart with n + e digits from the larger operand and 
{n + e) — d from the smaller operand; 

• alternatively, predict the number e of cancelled digits in the subtrac- 
tion, and directly subtract the [n + e] — d most-significant digits of the 
smaller operand from the n + e most-significant digits of the larger one. 

Note that in the first approach, we might have e = n ii all most-significant 
digits cancel, thus the process might need to be repeated several times. 

The first step in the second approach is usually called leading zero de- 
tection. Note that the number e of cancelled digits might depend on the 
rounding mode. For example, 6.778 — 5.7781 with a 3-digit result yields 0.999 
with rounding toward zero, and 1.00 with rounding to nearest. Therefore, in 
a real implementation, the definition of e has to be made precise. 

In practice we might consider n + g and {n + g) — d digits instead of n 
and n — d, where the g "guard digits" would prove useful (i) to decide the 
final rounding, and/or (ii) to avoid another loop in case e < g. 

Sterbenz's Theorem 

Sterbenz's Theorem is an important result concerning floating-point subtrac- 
tion (of operands of the same sign). It states that the rounding error is zero 
in some common cases. More precisely: 

Theorem 3.2.2 (Sterbenz) If x and y are two floating-point numbers of 
same precision n, such that y lies in the interval [x/2,2x] U [2x,x/2\, then 
y — X is exactly representable in precision n, if there is no underflow. 

Proof. The case x = y = is trivial, so assume that x ^ 0. Since y G 
[x/2,2x] U [2x,x/2], X and y must have the same sign. We assume without 
loss of generality that x and y are positive, so y E [x/2, 2x]. 

Assume x < y < 2x (the same reasoning applies ior x/2 < y < x, i.e., 
y < X < 2y, by interchanging x and y). Since x < y,we have ulp(x) < ulp(?/), 
thus y is an integer multiple of ulp(x). It follows that y — x is an integer 
multiple of ulp(x). Since 0<y — x<x,y — xis necessarily representable 
with the precision of x. r-| 
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It is important to note that Sterbenz's Theorem apphes for any radix /3; the 
constant 2 in [x/2, 2x] has nothing to do with the radix. 

3.3 Multiplication 

Muhiphcation of floating-point numbers is called a short product. This re- 
flects the fact that, in some cases, the low part of the full product of the 
significands has no impact — except perhaps for the rounding — on the final 
result. Consider the multiplication x x y, where x = and y = m[5^ . 
Then o{xy) = o(£m)/3^+-^, thus it suffices to consider the case that x = i 
and y = m are integers, and the product is rounded at some weight for 
g > 0. Either the integer product i x m is computed exactly, using one 
of the algorithms from Chapter [H and then rounded; or the upper part is 
computed directly using a "short product algorithm" , with correct rounding. 
The different cases that can occur are depicted in Figure 13.11 

An interesting question is: how many consecutive identical bits can occur 
after the round bit? Without loss of generality, we can rephrase this question 
as follows. Given two odd integers of at most n bits, what is the longest run 
of identical bits in their product? (In the case of an even significand, one 
might write it m = £2^ with i odd.) There is no a priori bound except the 
trivial one of 2?7, — 2 for the number of zeros, and 2n — 1 for the number of 
ones. For example, with a precision 5 bits, 27 x 19 = (1 000 000 001)2. More 
generally, such corresponds to a factorisation of 2^" ^ + 1 into two 

integers of n bits, for example 258 513 x 132 913 = 2^^^ + 1. 2n consecutive 
ones are not possible since 2^" — 1 can not factor into two integers of at 
most n bits. Therefore the maximal runs have 2n — 1 ones, for example 
217 X 151 = (111 111 111 111 111)2 for n = 8. A larger example is 849 583 x 
647089 = 2^9 - 1. 

The exact product of two floating-point numbers and m'/3^ is 

(mm')/?^"*"^ . Therefore, if no underflow or overflow occurs, the problem 
reduces to the multiplication of the significands m and m'. See Algorithm 
FPmultiply. 

The product at step [1] of FPmultiply is a short product, i.e., a product 
whose most significant part only is wanted, as discussed at the start of this 
section. In the quadratic range, it can be computed in about half the time of 
a full product. In the Karatsuba and Toom-Cook ranges, Mulders' algorithm 
can gain 10% to 20%; however, due to carries, implementing this algorithm 
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Figure 3.1: Different multiplication scenarios, according to the input and 
output precisions. The rectangle corresponds to the full product of the inputs 
X and y (most significant digits bottom left), the triangle to the wanted short 
product. Case (a): no rounding is necessary, the product being exact; case 
(b): the full product needs to be rounded, but the inputs should not be; 
case (c): the input x with the larger precision might be truncated before 
performing a short product; case (d): both inputs might be truncated. 

for floating-point computations is tricky. In the FFT range, no better al- 
gorithm is known than computing the full product mm' and then rounding 
it. 

Hence our advice is to perform a full product of m and m', possibly after 
truncating them to n + g digits if they have more than n + g digits. Here g 
(the number of guard digits) should be positive (see Exercise 13. 4p . 

It seems wasteful to multiply n-bit operands, producing a 2n-bit product, 
only to discard the low-order n bits. Algorithm ShortProduct computes 
an approximation to the short product without computing the 2n-bit full 
product. It uses a threshold no > 1, which should be optimized for the given 
code base. 
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Algorithm 3.3 FPmultiply 

Input: X = m ■ /3^, x' = m' ■ (3'^' , a precision n, a rounding mode o 
Output: o(xx') rounded to precision n 

1: m" ^ o(mm') rounded to precision n 

2: return m" ■ Z?*^"*"*^ . 



Error analysis of the short product. Consider two n-word normalised 
significands A and B that we multiply using a short product algorithm, where 
the notation FullProduct(A, B) means the full integer product A ■ B. 



Algorithm 3.4 ShortProduct 
Input: integers A, B, and n, with < A, B < 
Output: an approximation to AB div 
Require: a threshold no 

ii n < uq then return FullProduct(74, _B) div (3"' 

choose k > n/2, i '(^ n — k 

Ci ^ FullProduct(A div (3\ B div (3^) div 

C2 ShortProduct(A mod /3^ B div /3^ i) 

C3 ShortProduct(A div /3^ B mod £) 

return Ci + C2 + C3. 

Theorem 3.3.1 The value C returned by Algorithm ShortProduct differs 
from the exact short product C = AB div (3"' by at most 3(n — 1); 

C <C < C' + 3{n-l). 

Proof. First, since A, B are nonnegative, and all roundings are truncations, 
the inequality C < C follows. 

Let A = Ylii^iP^ ^^'^ B — Ylij^jP^ 1 where < aj,6j < [3. The possi- 
ble errors come from: (i) the neglected aibj terms, i.e., parts (72,6*3,(74 of 
Figure 13. 2t (ii) the truncation while computing Ci\ (iii) the error in the 
recursive calls for C2 and C^,- 

We first prove that the algorithm accumulates all products aihj with 
i + j > n — 1. This corresponds to all terms on and below the diagonal in 
Figure 13.21 The most significant neglected terms are the bottom-left terms 
from C2 and (^3, respectively a^^ibk-i and a^-ibi^i. Their contribution is at 
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Figure 3.2: Graphical view of Algorithm ShortProduct: the computed 
parts are Ci, C2, C3, and the neglected parts are C2, C3, C4 (most significant 
part bottom left). 

most 2(/3 — 1)^/3"""^. The neglected terms from the next diagonal contribute 
at most 4(/3 — 1)^/3'""'^, and so on. The total contribution of neglected terms 
is thus bounded by: 

(/3 - l)2/3"[2/3-2 + 4/3-3 ^ + 2/3" 

(the inequality is strict since the sum is finite). 

The truncation error in Ci is at most /3", thus the maximal difference 
e{n) between C and C satisfies: 

e{n) < 3 + 2e([n/2j), 

which gives e{n) < 3{n — 1), since 5(1) = 0. □ 

Remark: if one of the operands was truncated before applying Algorithm 
ShortProduct, simply add one unit to the upper bound (the truncated part 
is less than 1, thus its product by the other operand is bounded by /3'^). 

The complexity S{n) of Algorithm ShortProduct satifies the recurrence 
S{n) = M{k) + 2S{n — k). The optimal choice of k depends on the underlying 
multiplication algorithm. Assuming M{n) for a > 1 and k = 777., we 

S(n) = J ^M(n), 

V ^ 1-2(1-7)° 

where the optimal value is 7 = 1/2 in the quadratic range, 7 ~ 0.694 in 
the Karatsuba range, and 7 ~ 0.775 in the Toom-Cook 3-way range, giving 
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respectively S{n) ~ 0.5M(n), S{n) ~ 0.808M(n), and S{n) ~ 0.888M(n). 
The ratio S{n) / M{n) — )■ 1 as r — )■ oo for Toom-Cook r-way. In the FFT 
range, Algorithm ShortProduct is not any faster than a full product. 

3.3.1 Integer Multiplication via Complex FFT 

To multiply ra-bit integers, it may be advantageous to use the Fast Fourier 
Tranform (FFT for short, see §1.3.4^ §2.3p . Note that three FFTs give the 
cyclic convolution z = x * y defined by 

2fc = ^ XjVk-j mod N for < A; < iV. 

0<j<N 

In order to use the FFT for integer multiplication, we have to pad the input 
vectors with zeros, thus increasing the length of the transform from to 
2N. 

FFT algorithms fall into two classes: those using number theoretical prop- 
erties (typically working over a finite ring, as in §2.3.3p . and those based on 
complex floating-point computations. The latter, while not having the best 
asymptotic complexity, exhibit good practical behaviour, because they take 
advantage of the efficiency of floating-point hardware. The drawback of the 
complex floating-point FFT (complex FFT for short) is that, being based on 
floating-point computations, it requires a rigorous error analysis. However, 
in some contexts where occasional errors are not disastrous, one may accept 
a small probability of error if this speeds up the computation. For example, 
in the context of integer factorisation, a small probability of error is accept- 
able because the result (a purported factorisation) can easily be checked and 
discarded if incorrect. 

The following theorem provides a tight error analysis: 

Theorem 3.3.2 The complex FFT allows computation of the cyclic convo- 
lution z = X * y of two vectors of length N = 2" of complex values such 
that 

\\z' - 2||oo < ||X|| ■ ||y|| ■ ((1 +£)3"(1 +£y5)=^"+^(l +/i)3" - 1), (3.2) 

where || ■ || and || ■ ||oo denote the Euclidean and infinity norms respectively, e 
is such that |(a±6)' — (a±6)| < £|a±6|, |(a6)' — {ah)\ < £\ah\ for all machine 
floats a, h. Here fi > Kw'^)' - {w'')\, < k < N, w = e^''*/^, and (■)' refers 
to the computed (stored) value of {■) for each expression. 
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25 


25 
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24 


48 
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23 
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4864 


10 


19 


9728 
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18 


18432 


12 


17 


34816 


13 


17 


69632 


14 


16 


131072 


15 


16 


262144 


16 


15 


491520 


17 


15 


983040 


18 


14 


1835008 


19 


14 


3670016 


20 


13 


6815744 



Table 3.3: Maximal number h of bits per IEEE 754 double-precision floating- 
point number binary64 (53-bit significand) , and maximal m for a plain mxm 
bit integer product, for a given FFT size 2", with signed components. 



For the IEEE 754 double- precision format, with rounding to nearest, we have 
s = 2~^^, and if the w'' are correctly rounded, we can take n = e/ \pl. For a 
fixed FFT size iV = 2"^, the inequality (13. 2 p enables us to compute a bound 
B on the components of x and y that guarantees \\z' — z\\^ < 1/2. If we 
know that the exact result z G Z^, this enables us to uniquely round the 
components of z' to z. Table 13.31 gives b = IgB, the number of bits that 
can be used in a 64-bit floating-point word, if we wish to perform m-bit 
multiplication exactly (here m = 2"~^6). It is assumed that the FFT is 
performed with signed components in Z fl [— 2^~^, -|-2^~^), see for example 
[801 P- 161]. 

Note that Theorem 13.3.21 is a worst-case result; with rounding to nearest 
we expect the error to be smaller due to cancellation - see Exercise 13. 9[ 

Since 64-bit floating-point numbers have bounded precision, we can not 
compute arbitrarily large convolutions by this method — the limit is about 
n = 43. However, this corresponds to vectors of size iV = 2" = 2^^ > 10^^, 
which is more than enough for practical purposes. (See also Exercise 13.111 ) 
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3.3.2 The Middle Product 

Given two integers of 2n and n bits respectively, their "middle product" 
consists of the middle n bits of their 3n-bit product (see Fig. 13. 3p . The 



X 



y 

Figure 3.3: The middle product of x of n bits and y of 2n bits corresponds 
to the middle region (most significant bits bottom left). 

middle product might be computed using two short products, one (low) short 
product between x and the high part of y, and one (high) short product 
between x and the low part of y. However there are algorithms to compute 
a.2n X n middle product with the same ~M(n) complexity as an n x n full 
product (see §3.8p . 

Several applications benefit from an efficient middle product. One of 
these applications is Newton's method ( §4.2p . Consider, for example, the 
reciprocal iteration ( §4.2.2p : Xj+i = Xj + Xj{l — Xjy). If Xj has n bits, one 
has to consider 2n bits from y in order to get 2n accurate bits in Xj^i. The 
product Xjy has 3n bits, but if Xj is accurate to n bits, the n most significant 
bits of Xjy cancel with 1, and the n least significant bits can be ignored as 
they only contribute noise. Thus, the middle product of Xj and y is exactly 
what is needed. 

Payne and Hanek Argument Reduction 

Another application of the middle product is Payne and Hanek argument 
reduction. Assume x = m ■ 2^ is a fioating-point number with a significand 
0.5 < m < 1 of n bits and a large exponent e (say n = 53 and e = 1024 
to fix the ideas). We want to compute sinx with a precision of n bits. The 
classical argument reduction works as follows: first compute k = [x/vr], then 
compute the reduced argument 

X = X — k-K. (3.3) 

About e bits will be cancelled in the subtraction x — (kir), thus we need 
to compute kn with a precision of at least e + n bits to get an accuracy of 
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at least n bits for x'. Of course, this assumes that x is known exactly - 
otherwise there is no point in trying to compute sinx. Assuming I/tt has 
been precomputed to precision e, the computation of k costs M{e,n), and 
the multiplication k x ir costs M(e, e + n), thus the total cost is about M(e) 
when e S> n. 



Figure 3.4: A graphical view of Payne and Hanek algorithm. 
The key idea of the Payne and Hanek algorithm is to rewrite Eqn. (13. 3 p 

as 



If the significand of x has n < e bits, only about 2n bits from the expansion 
of I/tt will effectively contribute to the n most significant bits of x', namely 
the bits of weight 2"^"*^ to 2"^"'"'^. Let y be the corresponding 2?7,-bit part 
of I/tt. Payne and Hanek's algorithm works as follows: first multiply the 
n-bit significand of x by y, keep the n middle bits, and multiply by an n-bit 
approximation of vr. The total cost is ~ (M(2n, n) + M{n)), or even ~2M(n) 
if the middle product is performed in time M{n), thus independent of e. 

3.4 Reciprocal and Division 

As for integer operations ( §1.41) . one should try as far as possible to trade 
floating-point divisions for multiplications, since the cost of a floating-point 
multiplication is theoretically smaller than the cost of a division by a constant 
factor (usually from 2 to 5, depending on the algorithm used). In practice, 
the ratio might not even be constant unless care is taken in implementing 
division. Some implementations provide division with cost B(M(n) logn) or 



When several divisions have to be performed with the same divisor, a well- 
known trick is to first compute the reciprocal of the divisor ( §3.4.ip : then each 
division reduces to a multiplications by the reciprocal. A small drawback is 



X 




y 




(3.4) 
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that each division incurs two rounding errors (one for the reciprocal and 
one for multiphcation by the reciprocal) instead of one, so we can no longer 
guarantee a correctly rounded result. For example, in base ten with six digits, 
3.0/3.0 might evaluate to 0.999 999 = 3.0 x 0.333 333. 

The cases of a single division, or several divisions with a varying divisor, 
are considered in §3.4.21 



3.4.1 Reciprocal 

Here we describe algorithms that compute an approximate reciprocal of a 
positive floating-point number a, using integer-only operations (see Chap- 
ter [1]). The integer operations simulate floating-point computations, but 
all roundings are made explicit. The number a is represented by an inte- 
ger A of n words in radix /3: a = P~"'A, and we assume /3"/2 < A, thus 
requiring 1/2 < a < 1. (This does not cover all cases for /3 > 3, but if 
pn-i ^ ^ ^ /3"/2, multiplying A by some appropriate integer k < (3 will 
reduce to the case /3"/2 < A, then it suffices to multiply the reciprocal of ka 
by k.) 

We first perform an error analysis of Newton's method ( §4.2p assuming 
all computations are done with infinite precision, thus neglecting roundoff 
errors. 



Lemma 3.4.1 Let 1/2 < a < 1, p = 1/a, x > 0, and x' = x + x(l — ax). 
Then: ^ 

0<p-x'<^(p-a;)2, 
for some 9 G [min(x, p), max(x, p)] . 



Proof. Newton's iteration is based on approximating the function by its 
tangent. Let f{t) = a — l/t, with p the root of /. The second-order expansion 
of / at t = p with explicit remainder is: 

f{p) = fix) + {p- x)nx) + ^^^r (^), 

for some 9 G [min(x, p), max(x, p)]. Since /(p) = 0, this simplifies to 

m 2 fix) ■ ^^-^^ 



112 Modern Computer Arithmetic, version 0.5.1 of April 28, 2010 



Substituting f{t) = a- 1/t, f{t) = l/t^ and f"{t) = it follows that: 

p = X + x(l - ax) + —ip - xf, 

which proves the claim. |--| 

Algorithm ApproximateReciprocal computes an approximate recipro- 
cal. The input A is assumed to be normalised, i.e., /3"/2 < A < The 
output integer X is an approximation to 



Algorithm 3.5 ApproximateReciprocal 

Input: A = Y17=o with < < /3 and /3/2 < an-i 
Output: a: = /3" + ^"Jq^ Xi(3' with < Xi < (3 

1: if n < 2 then return \l3^''/A] - 1 

2: [(r2-l)/2j, h^n-i 

4: Xh i~ ApproximateReciprocal(Aft) 

5: T ^ AXh 

6: while T > do 

7: (X,,T)^(X,,-l,r-A) 

8: T ^ - T 

10: U ^ TrnXh 

11: return X;,/?^ + [f//3^-2'^J. 



Lemma 3.4.2 If (5 is a power of two satisfying f3 > 8, and /3"/2 < A < 13"', 
then the output X of Algorithm ApproximateReciprocal satisfies: 

AX < < A{X + 2). 

Proof. For n < 2 the algorithm returns X = \J?'^jA\, unless A = /3"/2 
when it returns X = 2(3"- — 1. In both cases we have AX < < A{X + 1), 
thus the lemma holds for n < 2. 

Now consider n > 3. We have i = [(n— 1)/2J and h = n—i, thus n = h+i 
and h > i. The algorithm first computes an approximate reciprocal of the 
upper h words of A, and then updates it to n words using Newton's iteration. 
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After the recursive call at line HJ we have by induction 

AhX,,</3^''<Ah{Xh + 2). (3.6) 

After the product T AXh and the while-loop at steps |6HZ1 we still have 
T = AXh, where T and Xh may have new values, and in addition T < 
We also have I3^~^^ < T + 2A; we prove this by distinguishing two cases. 
Either we entered the while-loop, then since the value of T decreased by A 
at each loop, the previous value T + A was necessarily > If we did not 

enter the while-loop, the value of T is still AX^- Multiplying Eqn. (13. 6p by 
f3^ gives: (3"^+^ < Ah(3^{Xh + 2) < A{Xh + 2) = T + 2 A. Thus we have: 

T < <T + 2A. 

It follows that T > -2A> - 2/3". As a consequence, the value of 
pn+h _ rp computed at step E can not exceed 2/3" — 1. The last lines compute 
the product TmXh, where Tm is the upper part of T, and put its £ most 
significant words in the low part Xi of the result X. 

Now let us perform the error analysis. Compared to Lemma 13.4. H x 
stands for Xh(3~^, a stands for Ap~"', and x' stands for X/3~". The while- 
loop ensures that we start from an approximation x < 1/a, i.e., AXh < 
Then Lemma 13.4.11 guarantees that x < x' < 1 /a if x' is computed with 
infinite precision. Here we have x < x', since X = XhP^ + Xi, where Xi > 0. 
The only differences compared to infinite precision are: 

• the low i words from 1 — ax — here T at line [8] — are neglected, and 
only its upper part (1 — ax)h — here Tm — is considered; 

• the low 2h — i words from x(l — ax)h are neglected. 

Those two approximations make the computed value of x' < the value which 
would be computed with infinite precision. Thus, for the computed value x', 
we have: 

X < x' < 1/a. 

From Lemma r3.4.1l the mathematical error is bounded by x'^9~^{p—xY < 
4/3"^'*, since x^ < 6^ and |p — x| < 2[5~^. The truncation from 1 — ax, which 
is multiplied by x < 2, produces an error < 2/3"^^*. Finally, the truncation of 
x(l — ax)h produces an error < /3"". The final result is thus: 



x' <p<x' + 6/3-2'^ + /3-". 
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Assuming 6/3 < /3 which holds as soon as /3 > 6 since 2h > n, this 
simphfies to: 

x' <p<x' + 2/3"", 
which gives with x' = X/?"" and p = P^/A: 

X < — <X + 2. 

~ A 

Since /3 is assumed to be a power of two, equahty can hold only when A is 
itself a power of two, i.e., A = P'^/2. In this case there is only one value 
of Xfi that is possible for the recursive call, namely X^ = 2(5^ — 1. In this 
case T = — /3"/2 before the while- loop, which is not entered. Then 

pn+h _ = /3"/2, which multiplied by Xh gives (again) (3^^'^ — (3^/2, whose 
h most significant words are /3 — 1. Thus = — 1, and X = 2/3" — 1. q 

Remark. Lemma [3.4.21 might be extended to the case /3"~^ < A < (3'^ or 
to a radix /3 which is not a power of two. However, we prefer to state a 
restricted result with simple bounds. 

Complexity Analysis. Let I{n) be the cost to invert an n-word number 
using Algorithm ApproximateReciprocal. If we neglect the linear costs, 
we have I{n) k, I{n/2) + M{n,n/2) + M{n/2), where M{n,n/2) is the cost 
of an 77, X {n/2) product — the product AX^ at step [5] — and M{n/2) the 
cost of an {n/2) x {n/2) product — the product T^Xh at step [TOl If the 
n X {n/2) product is performed via two {n/2) x (n/2) products, we have 
I{n) fa /(n/2) + 3M{n/2), which yields I{n) ~ M{n) in the quadratic range, 
~ 1.5M(?7,) in the Karatsuba range, ~ 1.704M(n) in the Toom-Cook 3-way 
range, and ~ 3M{n) in the FFT range. In the FFT range, an n x (n/2) 
product might be directly computed by three FFTs of length 3n/2 words, 
amounting to ~M(3n/4); in this case the complexity decreases to ~2.5M(n) 
(see the comments at the end of §2.3.31 page 62). 

The wrap-around trick. We now describe a slight modification of Algo- 
rithm ApproximateReciprocal which yields a complexity 2M{n). In the 
product AXh at step [H Eqn. (13. 6p tells us that the result approaches (3^~^^, 
or more precisely: 

pn+h _ ^ ^ pn+h ^ ^p^. (3.7) 

Assume we use an FFT-based algorithm such as the Schonhage-Strassen 
algorithm that computes products modulo /3™' + 1, for some integer m G 
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{n,n + h). Let AXh = f//3™ + 1/ with < V" < It follows from Eqn. (ETD 
that U = orU = Let T = AX/, mod (/J'^+l) be the value 

computed by the algorithm. We have T = V — U or T = V — U + (/S™ + 1). 
It follows that AXh = T + U{(3"' + 1) or AXh = T + (f/ - + 1). Taking 

into account the two possible values of U, we have 

AXh = T+ - e){(3"' + 1), 

where e G {0, 1, 2}. Since /3 > 6, /3"^ > 413"', thus only one value of e yields a 
value of AXh in the interval {(3''+^ - 2(3'', + 2/3"). 

Thus, we can replace step [5] in Algorithm ApproximateReciprocal by 
the following code: 

Compute T = AXh mod (Z?*" + 1) using FFTs with length m > n 
T + 13"'+''' + > the case e = 

while T > + 2/3" do 
T - (/3™ + 1) 

Assuming that one can take m close to n, the cost of the product AXh is 
only about that of three FFTs of length n, that is ~M(n/2). 

3.4.2 Division 

In this section we consider the case where the divisor changes between suc- 
cessive operations, so no precomputation involving the divisor can be per- 
formed. We first show that the number of consecutive zeros in the result is 
bounded by the divisor length, then we consider the division algorithm and 
its complexity. Lemma 13.4.31 analyses the case where the division operands 
are truncated, because they have a larger precision than desired in the re- 
sult. Finally we discuss "short division" and the error analysis of Barrett's 
algorithm. 

A floating-point division reduces to an integer division as follows. Assume 
dividend a = I ■ (3^ and divisor d = m ■ (3^ , where l,m are integers. Then 
a/d = {l/m)l3^~f . If k bits of the quotient are needed, we first determine 
a scaling factor g such that /3^^-'^ < \il3^/m\ < /3'^, and we divide £/3^ — 
truncated if needed — by m. The following theorem gives a bound on the 
number of consecutive zeros after the integer part of the quotient of [^/3^J 
by m. 
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Theorem 3.4.1 Assume we divide an m-digit positive integer by an n-digit 
positive integer in radix P, with m > n. Then the quotient is either exact, 
or its radix (3 expansion admits at most n — 1 consecutive zeros or ones after 
the digit of weight {3^ . 

Proof. We first consider consecutive zeros. If the expansion of the quotient 
q admits n or more consecutive zeros after the binary point, we can write 
q = qi + P~^qo, where qi is an integer and < go < 1- If 9o = 0, then 
the quotient is exact. Otherwise, if a is the dividend and d is the divisor, 
one should have a = qid + P'^q^d. However, a and qid are integers, and 
< P'^q^d < 1, so (3~"'q()d can not be an integer, so we have a contradiction. 

For consecutive ones, the proof is similar: write q = qi — P~^qo, with 
< go < 1- Since d < we still have < (3~"'qod < 1. □ 

Algorithm DivideNewton performs the division of two ra-digit floating- 
point numbers. The key idea is to approximate the inverse of the divi- 
sor to half precision only, at the expense of additional steps. At step HI 
MiddleProduct(go, d) denotes the middle product of go and d, i.e., the n/2 
middle digits of that product. At step [21 r is an approximation to 1/di, and 
thus to 1/d, with precision n/2 digits. Therefore at step[3l go approximates 
c/d to about n/2 digits, and the upper n/2 digits of qod at step H] agree with 
those of c. The value e computed at step S] thus equals qod — c to precision 
n/2. It follows that re ~ e/d agrees with go — c/d to precision n/2; hence 
the correction term (which is really a Newton correction) added in the last 
step. 

Algorithm 3.6 DivideNewton 

Input: n-digit floating-point numbers c and d, with n even, d normalised 
Output: an approximation of c/d 

1: write d = + do with 0<di,do< 13"'^'^ 

2: r ApproximateReciprocal((ii, n/2) 

3: go ^ cr truncated to n/2 digits 

4: e MiddleProduct(go, rf) 

5: g ^ go — re. 



In the FFT range, the cost of Algorithm DivideNewton is ~2.5M(n,): 
step [2] costs ~2M(n/2) ~ M{n) with the wrap-around trick, and steps [3H5] 
each cost ~M(n/2) — using a fast middle product algorithm for step HI By 
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way of comparison, if we computed a full precision inverse as in Barrett's 
algorithm (see below), the cost would be ~3.5M(n). (See §3.81 for improved 
asymptotic bounds on division.) 

In the Karatsuba range. Algorithm DivideNewton costs ~ 1.5M(n), 
and is useful provided the middle product of step S] is performed with cost 
~M(n/2). In the quadratic range. Algorithm DivideNewton costs 
~2M(?T,), and a classical division should be preferred. 

When the requested precision for the output is smaller than that of the 
inputs of a division, one has to truncate the inputs, in order to avoid an 
unnecessarily expensive computation. Assume for example that we want to 
divide two numbers of 10, 000 bits, with a 10-bit quotient. To apply the 
following lemma, just replace fi by an appropriate value such that Ai and 
Bi have about 2n and n digits respectively, where n is the desired number 
of digits in the quotient; for example we might choose /i = to truncate to 
k words. 

Lemma 3.4.3 Let A,B, ^ , 2 < ^ < B. Let Q = [A/B\ , Ai = [A/ 12] , 

Bi = [B/i2\, Qi = [Ai/Bi\. IfA/B < 2Bi, then 

Q<Qi<Q + 2. 

The condition A/B < 2Bi is quite natural: it says that the truncated divisor 
Bi should have essentially at least as many digits as the desired quotient. 

Proof. Let Ai = QiBi + Ri. We have A = Aifi + Aq, B = Bifi + Bq, thus 

A ^ Aifi + Aq ^ Aijl + Rif^ + ^0 

B Bifi + Bo~ Bifi Bifi 

Since Ri < Bi and Aq < fi, Rifi + Aq < Bifi, thus A/B < Qi + 1. Taking 
the floor of each side proves, since Qi is an integer, that Q < Qi- 

Now consider the second inequality. For given truncated parts Ai and 
Bi, and thus given Qi, the worst case is when A is minimal, say A = Aifi, 
and B is maximal, say B = Bifi + (/i — 1). In this case we have: 



Ai A 




Ai Aifj, 




A,{f,-1) 


Wi'b 




Bi + (/i - 1) 




Bi{Bifi + fi - 1) 



The numerator equals A — Ai < A, and the denominator equals BiB, thus 
the difference Ai/Bi — A/B is bounded by A/{BiB) < 2, and so is the 
difference between Q and Qi. n 
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Algorithm ShortDivision is useful in the Karatsuba and Toom-Cook 
ranges. The key idea is that, when dividing a 2n-digit number by an n- 
digit number, some work that is necessary for a full 2n-digit division can be 
avoided (see Figure [331) . 



Algorithm 3.7 ShortDivision 
Input: 0<A< f3'^'', /3"/2 < B < f3'' 
Output: an approximation of A/B 
Require: a threshold uq 
1: if n < uq then return 
2: choose k > n/2, i ^ n — k 
3: (Ai, Ao) ^ {A div A mod ^^^) 
4: Bo) ^ (B div (3^, B mod (3^) 
5: (gi,i?i) ^ DivRem(Ai,5i) 
6: A' ^ R^P^' + - QiBoP' 
7: Qo ^ ShortDivision(y4' div B div 
8: return QiP'^ + Qo- 



Theorem 3.4.2 T/ie approximate quotient Q' returned by ShortDivision 

differs at most by 2\gn from the exact quotient Q = [A/B\, more precisely: 

Q<Q' < g + 21gn. 

Proof. If n < riQ, Q = Q' so the statement holds. Assume n > Uq. We 
have A = AiP"^^ + Aq and B = + Bq, thus since Ai = QiBi + 

A = {Q^B^+R^)(3^^+Ao = QiB(3^+A', with A' < Let A' = A[(3^+A'q, 
and S = B[(3''+B'o, with < A'^^, B'^ < /3^ and < z^^^. From LemmaElSl 
the exact quotient of A div by 5 div [5^ is greater or equal to that of A' 
by B, thus by induction Qo > A! /B. Since A/5 = Qi(3^ + A'/^, this proves 
that Q' >Q. 

Now by induction Qo < ^'i/^l + 21g^, and A'JB[ < A'/B + 2 (from 
Lemma 13.4.31 again, whose hypothesis A'/B < 2B[ is satisfied, since A' < 
thus A'/B < < 25;), so Qo < A75 + 21gn, and Q' < A/B + 2\gn. 

□ 

As shown at the right of Figure 13751 we can use a short product to compute 
QiBo at step El Indeed, we need only the upper £ words of A', thus only 
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Figure 3.5: Divide and conquer short division: a graphical view. Left: with 
plain multiplication; right: with short multiplication. See also Figure [L3l 

the upper i words of QiBq. The complexity of Algorithm ShortDivision 

satisfies D*{n) = D{k) + M*{n - k) + D*{n - k) with k > n/2, where 
D{n) denotes the cost of a division with remainder, and M*{n) the cost of 
a short product. In the Karatsuba range we have D{n) ~ 2M{n), M*{n) ~ 
0.808M(n), and the best possible value of is ~ 0.542n, with corresponding 
cost D*{n) ~ 1.397M(n). In the Toom-Cook 3- way range, k ~ 0.548?t, is 
optimal, and gives D*{n) ~ 1.988M(n). 

Barrett's floating-point division algorithm 

Here we consider floating-point division using Barrett's algorithm and pro- 
vide a rigorous error bound (see §2.4.11 for an exact integer version). The 
algorithm is useful when the same divisor is used several times; otherwise 
Algorithm DivideNewton is faster (see Exercise I3.13p . Assume we want to 
divide a by 6 of n bits, each with a quotient of n bits. Barrett's algorithm is 
as follows: 

1. Compute the reciprocal r of 6 to bits [rounding to nearest] 
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2. g ^ o„(a X r) [rounding to nearest] 

The cost of the algorithm in the FFT range is ~3M(n): ~2M(n) to compute 
the reciprocal with the wrap-around trick, and M{n) for the product a x r. 

Lemma 3.4.4 At stepl^ of Barrett's algorithm, we have \a — bq\ < 3\b\/2. 

Proof. By scaling a and 6, we can assume that b and q are integers, that 
2"^^ <b,q< 2^^, thus a < 2^". We have r = 1/6 + e with \e\ < ulp(2-"/2) = 
2"^". Also q = ar + e' with < ulp(g)/2 = 1/2 since q has n bits. Thus 
q = a{l/b + e) + e' = a/b + ae + e', and \bq — a\ = \b\\ae + e'\ < 3|6|/2. □ 

As a consequence, q differs by at most one unit in last place from the n-bit 
quotient of a and b, rounded to nearest. 

Lemma 13.4.41 can be applied as follows: to perform several divisions with 
a precision of n bits with the same divisor, precompute a reciprocal with 
n + g bits, and use the above algorithm with a working precision of n + g 
bits. If the last g bits of q are neither 000 . . . OOx nor 111 . . . llx (where x 
stands for or 1), then rounding q down to n bits will yield o^[a/b) for a 
directed rounding mode. 

Which Algorithm to Use? 

In this section, we described three algorithms to compute x/y: Divide- 
Newton uses Newton's method for 1/y and incorporates the dividend x 
at the last iteration, ShortDivision is a recursive algorithm using division 
with remainder and short products, and Barrett's algorithm assumes we have 
precomputed an approximation to 1/y. When the same divisor y is used 
several times, clearly Barrett's algorithm is better, since each division costs 
only a short product. Otherwise ShortDivision is theoretically faster than 
DivideNewton in the schoolbook and Karatsuba ranges, and taking k = 
n/2 as parameter in ShortDivision is close to optimal. In the FFT range, 
DivideNewton should be preferred. 

3.5 Square Root 

Algorithm FPSqrt computes a floating-point square root, using as subrou- 
tine Algorithm SqrtRem ( §1.5.11 to determine an integer square root (with 
remainder). It assumes an integer significand m, and a directed rounding 
mode (see Exercise 13.141 for rounding to nearest). 
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Algorithm 3.8 FPSqrt 

Input: X = m ■ 2^, a target precision n, a directed rounding mode o 

Output: y = o^{y/x) 

if e is odd then [m', /) ^ (2m, e — 1) else (m', /) ^ (m, e) 

define m' := mi2^'^ + mg, mi integer of 2n or 2n — 1 bits, < mo < 2' 

(s,r) ^ SqrtRem(mi) 

if (o is round towards zero or down) or (r = mo = 0) 
then return s ■ 2^^^!"^ else return (s + 1) • 2^^^/"^. 



Theorem 3.5.1 Algorithm FPSqrt returns the correctly rounded square 
root of X. 

Proof. Since mi has 2n or 2?t, — 1 bits, s has exactly n bits, and we have 
X > s^2^'^+-^, thus -Jx > s2^"'"-^/^. On the other hand, SqrtRem ensures that 
r < 2s, thus x2^f = {s^ + r)2'^^ + mo < (s^ + r + 1)22'= < (s + 1)222^ Since 
y := s-2^^^ 1"^ and = (s + l)-2*'''^-^/2 ^re two consecutive n-bit floating-point 
numbers, this concludes the proof. □ 

Note: in the case s = 2" — 1, s + 1 = 2" is still representable in n bits. 

A different method is to use an initial approximation to the reciprocal 
square root 

f[ }3XT]) . see Exercise 13.151 Faster algorithms are mentioned 

in gnu 

3.5.1 Reciprocal Square Root 

In this section we describe an algorithm to compute the reciprocal square 
root a~^/2 of a floating-point number a, with a rigorous error bound. 

Lemma 3.5.1 Let a,x > ^, p = ar^l"^ , and x' = x ^ {xl2^{\ — ax^\ Then 

0<p-x'<^(p-x)2, 
/or some 9 G [min(x, p), max(x, p)] . 

Proof. The proof is very similar to that of Lemma 13.4.11 Here we use 
f{t) = a — l/t^, with p the root of /. Eqn. (13. 5p translates to: 

p = x + -{l-ax ) + ^(p - x) , 
which proves the Lemma. n 
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Algorithm 3.9 ApproximateRecSquareRoot 

Input: integer A with /3" < A < 4/3", /3 > 38 

Output: integer X, /3"/2 < A < /3" satisfying Lemma [33^ 

1: if n < 2 then return min(/3" - 1, [^''/x/]^;^^]) 

2: [{n-l)/2\, h^n-i 

3: A, ^ 

4: Aft ^ ApproximateRecSquareRoot (Aft) 
5: T ^ AA2 

6: Tft ^ L2^/3-"J 
7: T, ^ f3^^ - Tft 
8: f/ ^ T.Aft 

9: return min(/3" - 1, Aft/3^ + [Uf3^~^'' /2]). 



Lemma 3.5.2 Provided that (5 > 38, if X is the value returned by Algorithm 
ApproximateRecSquareRoot, a = Af3~'^ , x = A/3~", then 1/2 < x < 1 
and 

k-a^i/^^l < 2/3"". 

Proof. We have 1 < a < 4. Since A is bounded by /3" — 1 at hues [1] and [H 
we have x,Xh < 1, with Xh = Xh/3~^. We prove the statement by induction 
on n. It is true for n <2. Now assume the value Aft at step H] satisfies: 

kft-a-'/'l </3-^ 

where ah = Ah(3~^. We have three sources of error, that we will bound 
separately: 

1. the rounding errors in steps [6] and [9l 

2. the mathematical error given by Lemma [3.5. 11 which would occur even 
if all computations were exact; 

3. the error coming from the fact we use Aft instead of A in the recursive 
call at step HI 

At step O we have exactly: 

t := T/3-"-2^ = axl, 
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which gives {th — axH < /3 ^'^ with th := Th/3 "^^^ and in turn \te, — {l — ax\) \ < 
p-'^^ with := T^p-^^. At stepig it follows \u - Xh{l - axl)\ < P'^^, where 
u = Thus, finally \x - [xh + Xh{l - axl)/2]\ < {^-'^^ + /3-")/2, after 

taking into account the rounding error in the last step. 

Now we apply Lemma [3.5. II to x — ?■ x^, x' — j- x, to bound the mathemat- 
ical error, assuming no rounding error occurs: 

0<a-V^-x<|i(a-V2-x.f, 

which givesQ \a~^/^ - x\ < 3.04(a-i/2 _ ^.^y^ Now ja-^/^ _ q-i/2| ^ 

\a — a/i|i/~^/^/2 for u G [min(a/i, a), max(a/i, a)], thus |a~^/^ — a^^^^^l < /3~^/2. 

1 /2 I ^1 

Together with the induction hypothesis \xh — a-h I ^ 2/3" , it follows that 
\a-^'^ - Xh\ < 2.5/3-'^. Thus \a-^/'^ - x\ < l^p-^'^. 
The total error is thus bounded by: 

Since 2/i > + 1, we see that 19/3"^^ < /3-"/2 for /3 > 38, and the lemma 
follows. □ 

Note: if AhX^ < P'^^ at step H] of Algorithm ApproximateRecSquare- 
Root, we could have AXf^ > /3""'"^'* at step [5], which might cause to be 
negative. 

Let R{n) be the cost of ApproximateRecSquareRoot for an n-digit 
input. We have h,£ ^ n/2, thus the recursive call costs R{n/2), step [5] costs 
M(n/2) to compute Xl, and M(n) for the product AXl (or M(3ra/4) in the 
FFT range using the wrap-around trick described in §3.4. H since we know 
the upper n/2 digits of the product give 1), and again M{n/2) for step [3 
We get R{n) = R{n/2) + 2M{n) (or R{n/2) + 7M{n)/A in the FFT range), 
which yields R{n) ~ 4M(n) (or R{n) ~ 3.5M(n) in the FFT range). 

This algorithm is not optimal in the FFT range, especially when using an 
FFT algorithm with cheap point-wise products (such as the complex FFT, 

1 Since 9 e [xh.a-^/"^] and \xh - a-^/^\ < 2.5/3"'', we have 9 > xh - 2.5/?"'', thus 
Xh/0 < 1 + 2.5/3-'' /e < 1 + 5/?-'' (remember € [xh, a''^^'^]), and it follows that 61 > 1/2. 
For /3 > 38, since h>2,we have 1 + 5/3"'' < 1.0035, thus 1.5x1/0^ < (1.5/6l)(1.0035)3 < 
3.04. 
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see §3.3.ip . Indeed, Algorithm ApproximateRecSquareRoot uses the fol- 
lowing form of Newton's iteration: 



Here, the product might be computed with a single FFT transform of 
length 3n/2, replacing the point-wise products by xf, with a total cost 
~0.75M(r2). Moreover, the same idea can be used for the full product ax^ 
of 5n/2 bits, whose upper n/2 bits match those of x. Thus, using the wrap- 
around trick, a transform of length 2n is enough, with a cost of ~ M{n) for the 
last iteration, and a total cost of ~ 2M{n) for the reciprocal square root. With 
this improvement, the algorithm of Exercise 13.151 costs only ~2.25M(?t,). 

3.6 Conversion 

Since most software tools work in radix 2 or 2'^, and humans usually enter or 
read floating-point numbers in radix 10 or 10*^, conversions are needed from 
one radix to the other one. Most applications perform very few conversions, 
in comparison to other arithmetic operations, thus the efficiency of the con- 
versions is rarely critical^ The main issue here is therefore more correctness 
than efficiency. Correctness of floating-point conversions is not an easy task, 
as can be seen from the history of bugs in Microsoft Excell 

The algorithms described in this section use as subroutines the integer- 
conversion algorithms from Chapter [H As a consequence, their efficiency 
depends on the efficiency of the integer-conversion algorithms. 



In this section we follow the convention of using lower-case letters for param- 
eters related to the internal radix b, and upper-case for parameters related 

^ An important exception is the computation of billions of digits of constants like tt, log 2, 
where a quadratic conversion routine would be far too slow. 

^In Excel 2007, the product 850 x 77.1 prints as 100, 000 instead of 65, 535; this is really 
an output bug, since if one multiplies "100,000" by 2, one gets 131,070. An input bug 
occurred in Excel 3.0 to 7.0, where the input 1.40737488355328 gave 0.64. 




It might be better to write: 



x' = X + -{x 




3.6.1 Floating-Point Output 
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to the external radix B. Consider the problem of printing a floating-point 
number, represented internally in radix b (say 6 = 2) in an external radix B 
(say B = 10). We distinguish here two kinds of floating-point output: 

• fixed- format output, where the output precision is given by the user, 
and we want the output value to be correctly rounded according to the 
given rounding mode. This is the usual method when values are to 
be used by humans, for example to fill a table of results. The input 
and output precisions may be very different: for example one may 
want to print 1000 digits of 2/3, which uses only one digit internally 
in radix 3. Conversely, one may want to print only a few digits of a 
number accurate to 1000 bits. 

• free- format output, where we want the output value, when read with 
correct rounding (usually to nearest), to give exactly the initial number. 
Here the minimal number of printed digits may depend on the input 
number. This kind of output is useful when storing data in a file, while 
guaranteeing that reading the data back will produce exactly the same 
internal numbers, or for exchanging data between different programs. 

In other words, if x is the number that we want to print, and X is the printed 
value, the fixed-format output requires |x — X| < ulp(X), and the free-format 
output requires |x — X| < ulp(x) for directed rounding. Replace < ulp(-) by 
< ulp(-)/2 for rounding to nearest. 



Algorithm 3.10 PrintFixed 

Input: X = f ■ h^~^ with f,e,p integers, If^^ < \ f\ < If, external radix B 

and precision P, rounding mode o 
Output: X = F ■ B^-^ with F, E integers, B^-^ < \F\ < B^, such that 
X = o[x) in radix B and precision P 
1: A ^ o(log6/log5) 
2: E ^1+ L(e- 1)AJ 
3: q ^ \P/X] 

i: y -(r- oi^xB^ ^) with precision q 

5: if one can not round y to an integer then increase q and go to step H] 
6: F ^ Integer (y, o). > see §1.71 

7: if |F| > B^ then E E + 1 and go to step H 
8: return F, E. 
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Some comments on Algorithm PrintFixed: 

• it assumes that we have precomputed values of = o(log6/ log 5) for 
any possible external radix B (the internal radix b is assumed to be 
fixed for a given implementation). Assuming the input exponent e is 
bounded, it is possible — see Exercise 13.171 — to choose these values 
precisely enough that 



E 



log 6 



'logs 

thus the value of A at step [T] is simply read from a table; 



(3.8) 



• the difficult part is step HI where one has to perform the exponentiation 
B^~^ — remember all computations are done in the internal radix h — 
and multiply the result by x. Since we expect an integer of q digits in 
step El there is no need to use a precision of more than q digits in these 
computations, but a rigorous bound on the rounding errors is required, 
so as to be able to correctly round y; 

• in step "one can round y to an integer" means that the interval 
containing all possible values of xB^~^ — including the rounding errors 
while approaching xB^~^ , and the error while rounding to precision 
q — contains no rounding boundary (if o is a directed rounding, it 
should contain no integer; if o is rounding to nearest, it should contain 
no half-integer). 

Theorem 3.6.1 Algorithm PrintFixed is correct. 

Proof. First assume that the algorithm finishes. Eqn. (13. 8p implies B^~^ < 
b'^~^, thus \x\B^~^ > B^~^, which implies that \F\ > B^^^ at step [61 Thus 
B^~^ < l-^l < B^ at the end of the algorithm. Now, printing x gives F ■ B"^ 
iff printing xB^ gives F ■ B°'^^ for any integer k. Thus it suffices to check 
that printing xB^~^ gives F, which is clear by construction. 

The algorithm terminates because at step HJ xB^~^ , if not an integer, 
can not be arbitrarily close to an integer. If P — E' > 0, let /c be the number 
of digits of B^~^ in radix b, then xB^^^ can be represented exactly with 
p + k digits. If P — -E < 0, let (7 = B^~^, of k digits in radix b. Assume 
f /g = n + e with n integer; then f — gn = ge. If e is not zero, ge is a non-zero 
integer, thus \e\ >l/g> 2~^. 
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The case \F\ > at step [7] can occur for two reasons: either \x\B^~^ > 
B^, thus its rounding also satisfies this inequahty; or \x\B^~^ < B^, but 
its rounding equals B^ (this can only occur for rounding away from zero or 
to nearest). In the former case we have \x\B^~^ > B^~^ at the next pass 
in step m while in the latter case the rounded value F equals B^~^ and the 
algorithm terminates. g 

Now consider free- format output. For a directed rounding mode we want 
|x — X| < ulp(x) knowing |x — X| < ulp(X). Similarly for rounding to 
nearest, if we replace ulp by ulp /2. 

It is easy to see that a sufficient condition is that ulp(X) < ulp(x), or 
equivalently B^~^ < 6^"^ in Algorithm PrintFixed (with P not fixed at 
input, which explain the "free-format" name). To summarise, we have 

b^~^ < \x\ < B^~^ < \X\ < B^. 

Since |x| < and X is the rounding of x, it suffices to have B^~^ < b'^. It 
follows that B^~^ < b^B^~^ , and the above sufficient condition becomes: 

P>l+p^. 

logi? 

For example, with 6 = 2 and B = 10, p = 53 gives P > 17, and p = 24 gives 
P > 9. As a consequence, if a double-precision IEEE 754 binary floating- 
point number is printed with at least 17 signiflcant decimal digits, it can be 
read back without any discrepancy, assuming input and output are performed 
with correct rounding to nearest (or directed rounding, with appropriately 
chosen directions). 



3.6.2 Floating-Point Input 

The problem of floating-point input is the following. Given a floating-point 
number X with a signiflcand of P digits in some radix B (say B = 10), a 
precision p and a given rounding mode, we want to correctly round X to a 
floating-point number x with p digits in the internal radix b (say 6 = 2). 

At flrst glance, this problem looks very similar to the floating-point output 
problem, and one might think it suffices to apply Algorithm PrintFixed, 
simply exchanging {b,p,e,f) and {B,P,E,F). Unfortunately, this is not 
the case. The difficulty is that, in Algorithm PrintFixed, all arithmetic 
operations are performed in the internal radix b, and we do not have such 
operations in radix B (see however Exercise ll.37p . 
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3.7 Exercises 

Exercise 3.1 In ^3.1.5l we described a trick to get the next floating-point number 
in the direction away from zero. Determine for which IEEE 754 double-precision 
numbers the trick works. 

Exercise 3.2 (Kidder, Boldo) Assume a binary representation. The "round- 
ing to odd" mode [121 1149^ I221| is defined as follows: in case the exact value is 
not representable, it rounds to the unique adjacent number with an odd signif- 
icand. ("Von Neumann rounding" [32] omits the test for the exact value being 
representable or not, and rounds to odd in all nonzero cases.) Note that overflow 
never occurs during rounding to odd. Prove that if y = round(x,p + A;, odd) and 
z = round(y,p, nearest_even), and k > 1, then z = round(2;,p, nearest_even), i.e., 
the double-rounding problem does not occur. 

Exercise 3.3 Show that, if -y/a is computed using Newton's iteration for a'~^^'^: 

3 

x' = X + - {1 — ax^) 

(see N3.5.ip . and the identity y/a = a x a~^/^, with rounding mode "round towards 
zero" , then it might never be possible to determine the correctly rounded value of 
y/a, regardless of the number of additional guard digits used in the computation. 

Exercise 3.4 How does truncating the operands of a multiplication to n + 
digits (as suggested in ^3.3p affect the accuracy of the result? Considering the 
cases (7 = 1 and > 1 separately, what could happen if the same strategy were 
used for subtraction? 

Exercise 3.5 Is the bound of Theorem 13.3.11 optimal? 

Exercise 3.6 Adapt Mulders' short product algorithm |174j to floating-point 
numbers. In case the first rounding fails, can you compute additional digits with- 
out starting again from scratch? 

Exercise 3.7 Show that, if a balanced ternary system is used (radix 3 with digits 
{0, ±1}), then "round to nearest" is equivalent to truncation. 

Exercise 3.8 (Percival) Suppose we compute the product of two complex 
floating-point numbers zq = uq + ibo and zi = oi + ibi in the following way: 
Xa = o(aoai), Xb = o(6o&i), Va = o(ao&i), Vb = °{aibo), z = o{xa-Xb) + io{ya + yb). 
All computations are done in precision n, with rounding to nearest. Compute an 
error bound of the form \z — zqZi\ < c2~^\zqZi\. What is the best possible con- 
stant c? 
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Exercise 3.9 Show that, if fi = 0{e) and ne < 1, the bound in Theorem 13.3.21 
simphfies to 

\\z' - z\\oo = 0{\x\ ■ \y\ ■ ne). 

If the rounding errors cancel we expect the error in each component of z' to be 
0(|2;| • |y| ■n^^'^e). The error H^' — 2:||oo could be larger since it is a maximum of = 
2" component errors. Using your favourite implementation of the FFT, compare 
the worst-case error bound given by Theorem 13.3.21 with the error \\z' — z||oo that 
occurs in practice. 

Exercise 3.10 (Enge) Design an algorithm that correctly rounds the product of 
two complex floating-point numbers with 3 multiplications only. [Hint: assume all 
operands and the result have n-bit significand.] 

Exercise 3.11 Write a computer program to check the entries of Table [331 are 
correct and optimal, given Theorem 13. 3. 2[ 

Exercise 3.12 (Bodrato) Assuming one uses an FFT modulo f3"^ — 1 in the 
wrap-around trick, how should one modify step [5] of ApproximateReciprocal? 

Exercise 3.13 To perform k divisions with the same divisor, which of Algorithm 
DivideNewton and Barrett's algorithm is faster? 

Exercise 3.14 Adapt Algorithm FPSqrt to the rounding to nearest mode. 

Exercise 3.15 Devise an algorithm similar to Algorithm FPSqrt but using Al- 
gorithm ApproximateRecSquareRoot to compute an n/2-bit approximation 
to and doing one Newton- like correction to return an n-bit approximation 

range, your algorithm should take time ~3M(n) (or better). 

Exercise 3.16 Prove that for any n-bit floating-point numbers {x,y) ^ (0,0), 
and if all computations are correctly rounded, with the same rounding mode, the 
result of x/yx^ + y^ lies in [—1, 1], except in a special case. What is this special 
case and for what rounding mode does it occur? 

Exercise 3.17 Show that the computation of E in Algorithm PrintFixed, step[2l 
is correct — i.e., E = 1 + [(e — 1) log log i?J — as long as there is no integer 
n such that \n/{e — 1) log log 6 — 1| < e, where e is the relative precision when 
computing A: A = log i?/ log 6(1 -|- 9) with \6\ < e. For a fixed range of exponents 
— Cmax < e < Cmax, deducc a working precision e. Application: for 6 = 2, and 
Cmax = 2^^, compute the required precision for 3 < i? < 36. 
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Exercise 3.18 (Lefevre) The IEEE 754-1985 standard required binary to deci- 
mal conversions to be correctly rounded in the range m ■ 10"" for \m\ < 10^^ — 1 and 
\n\ < 27 in double precision. Find the hardest-to-print double-precision number 
in this range (with rounding to nearest, for example). Write a C program that 
outputs double-precision numbers in this range, and compare it to the sprintf 
C-language function of your system. Similarly for a conversion from the IEEE 
754-2008 binary64 format (significand of 53 bits, 2"^°^^ < \x\ < 2^°24^ 
decimal64 format (significand of 16 decimal digits). 

Exercise 3.19 The same question as in Exercise I3.18| but for decimal to binary 
conversion, and the atof C-language function. 

3.8 Notes and References 

In her PhD thesis [163^ Chapter V], Valerie Menissier-Morain discusses contin- 
ued fractions and redundant representations as alternatives to the classical non- 
redundant representation considered here. She also considers [1631 Chapter III] 
the theory of computable reals, their representation by S-adic numbers, and the 
computation of algebraic or transcendental functions. 

Other representations were designed to increase the range of representable 
values; in particular Clenshaw and Olver j70j invented level-index arithmetic, where 
for example 2009 is approximated by 3.7075, since 2009 exp(exp(exp(0.7075))), 
and the leading 3 indicates the number of iterated exponentials. The obvious 
drawback is that it is expensive to perform arithmetic operations such as addition 
on numbers in the level-index representation. 

Clenshaw and Olver |69] also introduced the idea of unrestricted algorithm 
(meaning no restrictions on the precision or exponent range). Several such algo- 
rithms were described in |48j . 

Nowadays most computers use radix two, but other choices (for example radix 
16) were popular in the past, before the widespread adoption of the IEEE 754 
standard. A discussion of the best choice of radix is given in |42] . 

For a general discussion of floating-point addition, rounding modes, the sticky 
bit, etc., see Hennessy, Patterson and Goldberg [120| Appendix A.4]0 

The main reference for floating-point arithmetic is the IEEE 754 standard [5], 
which defines four binary formats: single precision, single extended (deprecated), 
double precision, and double extended. The IEEE 854 standard [72] defines radix- 
independent arithmetic, and mainly decimal arithmetic. Both standards were 

^We refer to the first edition as later editions may not include the relevant Appendix 
by Goldberg. 
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replaced by the revision of IEEE 754 (approved by the IEEE Standards Committee 
on June 12, 2008). 

We have not found the source of Theorem 13.1.11 — it seems to be "folklore" . 
The rule regarding the precision of a result given possibly differing precisions of 
the operands was considered by Brent [49] and Hull |127j . 

Floating-point expansions were introduced by Priest [187j . They are mainly 
useful for a small number of summands, typically two or three, and when the main 
operations are additions or subtractions. For a larger number of summands the 
combinatorial logic becomes complex, even for addition. Also, except in simple 
cases, it seems difficult to obtain correct rounding with expansions. 

Some good references on error analysis of floating-point algorithms are the 
books by Higham |121j and Muller |175j . Older references include Wilkinson's 
classics [22911230] . 

Collins and Krandick [73], and Lefevre [154| . proposed algorithms for multiple- 
precision floating-point addition. 

The problem of leading zero anticipation and detection in hardware is classical; 
see [195] for a comparison of different methods. Sterbenz's theorem may be found 
in his book [2TT] . 

The idea of having a "short product" together with correct rounding was stud- 
ied by Krandick and Johnson [146] . They attributed the term "short product" 
to Knuth. They considered both the schoolbook and the Karatsuba domains. 
Algorithms ShortProduct and ShortDivision are due to Mulders [174j . The 
problem of consecutive zeros or ones — also called runs of zeros or ones — has 
been studied by several authors in the context of computer arithmetic: lordache 
and Matula [129] studied division (Theorem 13.4. ip . square root, and reciprocal 
square root. Muller and Lang [152| generalised their results to algebraic functions. 

The Fast Fourier Transform (FFT) using complex floating-point numbers and 
the Schonhage-Strassen algorithm are described in Knuth [143] . Many variations 
of the FFT are discussed in the books by Crandall [791180] . For further references, 
see g231 

Theorem 13.3.21 is from Percival [184j ; previous rigorous error analyses of com- 
plex FFT gave very pessimistic bounds. Note that [55] corrects the erroneous proof 
given in [184j (see also Exercise 13. 8p . 

The concept of "middle product" for power series is discussed in Hanrot et 
al. [lllj . Bostan, Lecerf and Schost [40] have shown that it can be seen as a 
special case of "Tellegen's principle", and have generalised it to operations other 
than multiplication. The link between usual multiplication and the middle prod- 
uct using trilinear forms was mentioned by Victor Pan [182] for the multiplication 
of two complex numbers: "T/ie duality technique enables us to extend any suc- 
cessful bilinear algorithms to two new ones for the new problems, sometimes quite 
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different from the original problem • • • " David Harvey [115j has shown how to effi- 
ciently implement the middle product for integers. A detailed and comprehensive 
description of the Payne and Hanek argument reduction method can be found in 
Muller p75] . 

In this section we drop the "~" that strictly should be included in the complex- 
ity bounds. The 2M(n) reciprocal algorithm of §3.4.11 — with the wrap-around 
trick — is due to Schonhage, Grotefeld and Vetter |199j . It can be improved, as 
noticed by Dan Bernstein |20j . If we keep the FFT-transform of x, we can save 
M(n)/3 (assuming the term-to-term products have negligible cost), which gives 
5M(n)/3. Bernstein also proposes a "messy" 3M(n)/2 algorithm [2^. Schonhage's 
3M(n)/2 algorithm is simpler |198j . The idea is to write Newton's iteration as 
x' = 2x — ax"^. If X is accurate to n/2 bits, then ax^ has (in theory) 2n bits, but 
we know the upper n/2 bits cancel with x, and we are not interested in the low 
n bits. Thus we can perform modular FFTs of size 'in/2, with cost M(3n/4) for 
the last iteration, and 1.5M(n) overall. This 1.5M(n) bound for the reciprocal 
was improved to 1.444M(n) by Harvey |116| . See also [78] for the roundoff error 
analysis when using a floating-point multiplier. 

The idea of incorporating the dividend in Algorithm DivideNewton is due 
to Karp and Markstein |138j . and is usually known as the Karp-Markstein trick; 
we already used it in Algorithm ExactDivision in Chapter [TJ The asymptotic 
complexity 5M(n)/2 of floating-point division can be improved to 5M(n)/3, as 
shown by van der Hoeven in |125j . Another well-known method to perform a 
floating-point division is Goldschmidt's iteration: starting from a/h, first find c 
such that hi = cb is close to 1, and a/b = ai/bi with ai = ca. At step k, as- 
suming a/b = ak/bk, we multiply both and 6^ by 2 — b^, giving a^+i and 
The sequence (bk) converges to 1, and (a^) converges to a/b. Goldschmidt's iter- 
ation works because, if 6^ = 1 -|- with small, then = (1 -|- efc)(l — £k) 
= 1 — e\. Goldschmidt's iteration admits quadratic convergence like Newton's 
method. However, unlike Newton's method, Goldschmidt's iteration is not self- 
correcting. Thus, it yields an arbitrary precision division with cost Q{M{n) logn). 
For this reason, Goldschmidt's iteration should only be used for small, fixed preci- 
sion. A detailed analysis of Goldschmidt's algorithms for division and square root, 
and a comparison with Newton's method, is given in Markstein [159| . 

Bernstein |20] obtained faster square root algorithms in the FFT domain, by 
caching some Fourier transforms. More precisely, he obtained llM(n)/6 for the 
square root, and 5M(n)/2 for the simultaneous computation of x^^^ and x~^/'^. 
The bound for the square root was reduced to 4M(n)/3 by Harvey |116| . 

Classical fioating-point conversion algorithms are due to Steele and White 
|208j . Gay |103j . and Clinger [TT]; most of these authors assume fixed precision. 
Cowlishaw maintains an extensive bibliography of conversion to and from deci- 
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mal formats (see §5.30 . What we call "free- format" output is called "idempotent 
conversion" by Kahan |133j : see also Knuth [1431 exercise 4.4-18]. Another useful 
reference on binary to decimal conversion is Cornea et al. |77j . 

Biirgisser, Clausen and Shokrollahi [59] is an excellent book on topics such 
as lower bounds, fast multiplication of numbers and polynomials, Strassen-like 
algorithms for matrix multiplication, and the tensor rank problem. 

There is a large literature on interval arithmetic, which is outside the scope of 
this chapter. A recent book is Kulisch |150j . and a good entry point is the Interval 
Computations web page (see Chapter [5]). 

In this chapter we did not consider complex arithmetic, except where rele- 
vant for its use in the FFT. An algorithm for the complex (floating-point) square 
root, which allows correct rounding, is given in [9T]. See also the comments on 
Priedland's algorithm in ^4.121 



Chapter 4 



Elementary and Special 
Function Evaluation 

Here we consider various applications of Newton's method, which 
can be used to compute reciprocals, square roots, and more gen- 
erally algebraic and functional inverse functions. We then con- 
sider unrestricted algorithms for computing elementary and spe- 
cial functions. The algorithms of this chapter are presented at a 
higher level than in Chapter [31 A full and detailed analysis of 
one special function might be the subject of an entire chapter! 

4.1 Introduction 

This chapter is concerned with algorithms for computing elementary and 
special functions, although the methods apply more generally. First we con- 
sider Newton's method, which is useful for computing inverse functions. For 
example, if we have an algorithm for computing y = \nx, then Newton's 
method can be used to compute x = expy (see §4.2.5p . However, Newton's 
method has many other applications. In fact we already mentioned Newton's 
method in Chapters [IH31 but here we consider it in more detail. 

After considering Newton's method, we go on to consider various meth- 
ods for computing elementary and special functions. These methods in- 
clude power series ( §4.4p . asymptotic expansions ( §4.5p . continued fractions 
( §4.6p . recurrence relations ( §4.7p . the arithmetic-geometric mean ( §4.8p . bi- 
nary splitting ( §4.9p . and contour integration ( §4.10p . The methods that we 
consider are unrestricted in the sense that there is no restriction on the at- 
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tamable precision — in particular, it is not limited to the precision of IEEE 
standard 32-bit or 64-bit floating-point arithmetic. Of course, this depends 
on the availability of a suitable software package for performing floating-point 
arithmetic on operands of arbitrary precision, as discussed in Chapter [31 

Unless stated explicitly, we do not consider rounding issues in this chapter; 
it is assumed that methods described in Chapter [3] are used. Also, to simplify 
the exposition, we assume a binary radix (/3 = 2), although most of the 
content could be extended to any radix. We recall that n denotes the relative 
precision (in bits here) of the desired approximation; if the absolute computed 
value is close to 1, then we want an approximation to within 2~^. 

4.2 Newton's Method 

Newton's method is a major tool in arbitrary-precision arithmetic. We have 
already seen it or its p-adic counterpart, namely Hensel lifting, in previous 
chapters (see for example Algorithm ExactDivision in §1.4.5[ or the itera- 
tion fl2.3p to compute a modular inverse in §2.5p . Newton's method is also 
useful in small precision: most modern processors only implement addition 
and multiplication in hardware; division and square root are microcoded, 
using either Newton's method if a fused multiply-add instruction is avail- 
able, or the SRT algorithm. See the algorithms to compute a floating-point 
reciprocal or reciprocal square root in §3.4.11 and §3.5.11 

This section discusses Newton's method is more detail, in the context of 
floating-point computations, for the computation of inverse roots ( §4.2.ip . 
reciprocals ( §4.2.2p . reciprocal square roots ( §4.2.3p . formal power series 
( §4.2.4p . and functional inverses ( §4.2.5p . We also discuss higher order Newton- 
like methods ( §4.2.6p . 

Newton's Method via Linearisation 

Recall that a function / of a real variable is said to have a zero ( if f{() = 0. 
If / is differentiable in a neighbourhood of (, and /'(C) 7^ 0, then ( is said 
to be a simple zero. Similarly for functions of several real (or complex) 
variables. In the case of several variables, C is a simple zero if the Jacobian 
matrix evaluated at ( is nonsingular. 

Newton's method for approximating a simple zero C of / is based on the 
idea of making successive linear approximations to f{x) in a neighbourhood 
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of (. Suppose that xq is an initial approximation, and that f{x) has two 
continuous derivatives in the region of interest. From Taylor's theorem^ 

/(C) = fixo) + (C - Xo)nxo) + ii^^ /"(O (4.1) 
for some point ^ in an interval including {(, xq}. Since /(C) = 0, we see that 

xi=xo- /(xo)//'(xo) 
is an approximation to (, and 

xi - C = O (|xo - CI') ■ 
Provided Xq is sufficiently close to C, we will have 

ki-Cl < |a;o-Cl/2< 1. 
This motivates the definition of Newton's method as the iteration 

x,^, = x,-j^y J = 0,1,... (4.2) 

Provided |xo — CI is sufficiently small, we expect x„ to converge to C- The 
order of convergence will be at least two, that is 

|en+l| ^ -^|Cn|' 

for some constant K independent of n, where e„ = x„ — C is the error after 
n iterations. 

A more careful analysis shows that 

en+i = ^e^ + 0(|e^|), (4.3) 

provided f E near (. Thus, the order of convergence is exactly two if 
/"(C) 7^ and cq is sufficiently small but nonzero. (Such an iteration is also 
said to be quadratically convergent.) 



"'^Here we use Taylor's theorem at xq, since this yields a formula in terms of derivatives 
at Xq, which is known, instead of at which is unknown. Sometimes (for example in the 
derivation of (|4.3|) ). it is preferable to use Taylor's theorem at the (unknown) zero C- 
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4.2.1 Newton's Method for Inverse Roots 

Consider applying Newton's method to the function 

/(x)=i/-x-^ 

where m is a positive integer constant, and (for the moment) ?/ is a positive 
constant. Since f'{x) = mx~^'^~^^\ Newton's iteration simphfies to 

Xj+i = Xj + Xj{l — x^y)/m. (4.4) 

This iteration converges to C = y"^/™ provided the initial approximation xq 
is sufficiently close to C. It is perhaps surprising that f l4.4p does not involve 
divisions, except for a division by the integer constant m. In particular, we 
can easily compute reciprocals (the case m = 1) and reciprocal square roots 
(the case m = 2) by Newton's method. These cases are sufficiently important 
that we discuss them separately in the following subsections. 



4.2.2 Newton's Method for Reciprocals 

Taking m = 1 in (14. 4p . we obtain the iteration 

Xj+i = Xj + Xj{l — Xju) (4.5) 

which we expect to converge to l/y provided Xq is a sufficiently good ap- 
proximation. (See §3.4.11 for a concrete algorithm with error analysis.) To 
see what "sufficiently good" means, define 

Uj = 1 — Xjy. 

Note that Uj — )■ if and only if xj — 1/y. Multiplying each side of (14. 5 p 

by y, we get 

1 -Mj+l = (1 -Uj){l+Uj), 

which simplifies to 

Uj+i = u]. (4.6) 

Thus 

Uj = {uof . (4.7) 

We see that the iteration converges if and only if |mo| < 1, which (for real xq 
and y) is equivalent to the condition xoy G (0,2). Second-order convergence 
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is reflected in the double exponential with exponent 2 on the right-hand- side 

of gZD. 

The iteration (14. 5 p is sometimes implemented in hardware to compute re- 
ciprocals of floating-point numbers (see §4.121) . The sign and exponent of the 
floating-point number are easily handled, so we can assume that y G [0.5, 1.0) 
(recall we assume a binary radix in this chapter). The initial approximation 
xq is found by table lookup, where the table is indexed by the first few bits of 
y. Since the order of convergence is two, the number of correct bits approxi- 
mately doubles at each iteration. Thus, we can predict in advance how many 
iterations are required. Of course, this assumes that the table is initialised 
correctl}^ 

Computational Issues 

At first glance, it seems better to replace Eqn. (14. 5 p by 

Xj+i = Xj{2- xjy), (4.8) 

which looks simpler. However, although those two forms are mathematically 
equivalent, they are not computationally equivalent. Indeed, in Eqn. (14.50 . 
if Xj approximates 1/y to within n/2 bits, then 1 — xjy = 0(2~"/^), and the 
product of Xj by 1 — Xjy might be computed with a precision of only n/2 
bits. In the apparently simpler form (14. 8p . 2 — Xjy = 1 + 0(2"'^/^), thus the 
product of Xj by 2 — Xjy has to be performed with a full precision of n bits, 
to get Xj+i accurate to within n bits. 

As a general rule, it is best to separate the terms of different order in 
Newton's iteration, and not try to factor common expressions. For an ex- 
ception, see the discussion of Schonhage's 3M(r2)/2 reciprocal algorithm in 

4.2.3 Newton's Method for (Reciprocal) Square Roots 

Taking m = 2 in (14. 4p . we obtain the iteration 

Xj+i = Xj +Xj{l- x]y)/2, (4.9) 

^In the case of the infamous Pentium fdiv hug |1091 I176j . a lookup table used for 
division was initialised incorrectly, and the division was occasionally inaccurate. In this 
case division used the SRT algorithm, but the moral is the same - tables must be initialised 
correctly. 
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which we expect to converge to y~^^'^ provided xq is a sufficiently good ap- 
proximation. 

If we want to compute y^^"^, we can do this in one multiplication after 
first computing y~^^'^, since 

y'^' = y X y-'/'. 

This method does not involve any divisions (except by 2, see Ex. I3.15p . In 
contrast, if we apply Newton's method to the function /(x) = — y, we 
obtain Heron'^ Iteration (see Algorithm Sqrtint in §1.5. ip for the square 
root of y: 

^.+1 = \ + ^) • (4.10) 

This requires a division by Xj at iteration j, so it is essentially different from 
the iteration (14. 9p . Although both iterations have second-order convergence, 
we expect (14. 9 p to be more efficient (however this depends on the relative 
cost of division compared to multiphcation) . See also §3. 5. H and, for various 
optimisations, §3.81 



4.2.4 Newton's Method for Formal Power Series 

This section is not required for function evaluation, however it gives a com- 
plementary point of view on Newton's method, and has applications to com- 
puting constants such as Bernoulli numbers (see Exercises I4.41f[4l42|) . 

Newton's method can be applied to find roots of functions defined by 
formal power series as well as of functions of a real or complex variable. For 
simplicity we consider formal power series of the form 

A{z) = ao + aiz + 02-2^ + • • • 

where aj G M (or any field of characteristic zero) and ord(A) = 0, i.e., 7^ 0. 

For example, if we replace y in (14. 5 p by 1 — z, and take initial approxi- 
mation = 1, we obtain a quadratically-convergent iteration for the formal 
power series 

00 

(l-^)-l = ^^^ 

n=0 



■^Hcron of Alexandria, circa 10-75 AD. 
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In the case of formal power series, "quadratically convergent" means that 
ord(ej) +00 hke 2^ , where cj is the difference between the desired result 
and the jth approximation. In our example, with the notation of §4.2.2[ 
Mo = 1 — xoy = z, so Uj = z^^ and 



but there is no useful analogue for multiple-precision integers Yl^=o^jl^'' ■ 
This means that some fast algorithms for operations on power series have no 
analogue for operations on integers (see for example Exercise 14. ip . 

4.2.5 Newton's Method for Functional Inverses 

Given a function g{x), its functional inverse h{x) satisfies g{h{x)) = x, and 
is denoted by h{x) := g^~^\x). For example, g{x) = Inx and h{x) = expx 
are functional inverses, as are g{x) = tanx and h{x) = arctanx. Using the 
function /(x) = y — g{x) in ( 14. 2p . one gets a root ( of /, i.e., a value such 
that g{C) = y, or C = g'^^^Kv)- 



Since this iteration only involves g and g', it provides an efficient way to 
evaluate h{y), assuming that g{xj) and g\xj) can be efficiently computed. 
Moreover, if the complexity of evaluating g' — and of division — is no greater 
than that of g, we get a means to evaluate the functional inverse h oi g with 
the same order of complexity as that of g. 

As an example, if one has an efficient implementation of the logarithm, 
a similarly efficient implementation of the exponential is deduced as follows. 
Consider the root of the function /(x) = y— Inx, which yields the iteration: 




Given a formal power series A{z) = ^ 
derivative 



ttjZ^ , we can define the formal 





y - gjxj) 



Xj+i = Xj + Xj{y — Inxj), 



(4.11) 
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and in turn Algorithm LiftExp (for the sake of simphcity, we consider here 
only one Newton iteration). 



Algorithm 4.1 LiftExp 

Input: Xj, (?T,/2)-bit approximation to exp{y) 

Output: Xj+i, n-bit approximation to exp{y) 
t -^Inxj o t computed to n-bit accuracy 

u ^ y — t > u computed to (n/2)-bit accuracy 

V 4- XjU > V computed to (n/2)-bit accuracy 

Xj+i ^ Xj + V. 



4.2.6 Higher Order Newton-like Methods 

The classical Newton's method is based on a linear approximation of f{x) 
near xq. If we use a higher-order approximation, we can get a higher-order 
method. Consider for example a second-order approximation. Equation (14. ip 
becomes: 

/(c) = fixo) + (c - xo)nxo) + ^^^/"(xo) + ^^^r (0- 

Since f{() =0, we have 

A difficulty here is that the right-hand- side of (I4.12p involves the unknown (. 
Let C = xq — f{xo)/f'{xo) + z/, where u is a. second-order term. Substituting 
this in the right-hand-side of (I4.12p and neglecting terms of order — xo)^ 
yields the cubic iteration: 



' fix,) 2f(x,)3 



For the computation of the reciprocal ( §4.2.20 with f{x) = y — 1/x, this 
yields 

Xj+i = Xj + Xj{l — Xjy) + Xj{l — xjyY. (4-13) 
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For the computation of expy using functional inversion ( §4.2.5p . one gets: 
Xj+i = Xj + Xj{y - Inxj) + ^Xj{y - Inxj)"^. (4.14) 

These iterations can be obtained in a more systematic way that generahses 
to give iterations of arbitrarily high order. For the computation of the recip- 
rocal, let Ej = 1 — Xjy, so Xjy = 1 — Sj and (assuming \ej\ < 1), 

1/y = Xj/{1 - Ej) = Xj{l +ej + e] + --- ). 

Truncating after the term e^"^ gives a k-th. order iteration 

Xj+, = x,{l + + 4 + ■ ■ ■ + 4-I) (4.15) 

for the reciprocal. The case k = 2 corresponds to Newton's method, and the 
case A; = 3 is just the iteration (14.131) that we derived above. 

Similarly, for the exponential we take ej = y — Inxj = ln{x/xj), so 

00 



x/xj=expej = Xl;^- 
Truncating after k terms gives a k-th order iteration 



Xj-\-i — X 



ml 

\m=0 



for the exponential function. The case k = 2 corresponds to the Newton 
iteration, the case /c = 3 is the iteration fl4.14p that we derived above, and 
the cases k > 3 give higher-order Newton-like iterations. For a generalisation 
to other functions, see Exercises 14.31 14.61 



4.3 Argument Reduction 

Argument reduction is a classical method to improve the efficiency of the 
evaluation of mathematical functions. The key idea is to reduce the initial 
problem to a domain where the function is easier to evaluate. More precisely, 
given / to evaluate at x, one proceeds in three steps: 
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• argument reduction: x is transformed into a reduced argument x'; 

• evaluation: f is evaluated at x'; 

• reconstruction: f{x) is computed from f{x') using a functional identity. 

In some cases the argument reduction or the reconstruction is trivial, for 
example x' = x/2 in radix 2, or f{x) = ±f{x') (some examples illustrate this 
below). It might also be that the evaluation step uses a different function g 
instead of /; for example sin(x + 7r/2) = cos(x). 

Unfortunately, argument reduction formulae do not exist for every func- 
tion; for example, no argument reduction is known for the error function. 
Argument reduction is only possible when a functional identity relates f{x) 
and f{x') (or g{x) and g{x')). The elementary functions have addition for- 
mulae such as 



exp(a; + y) = exp(a:) exp{y), 

\og{xy) = log(x) + log(?/), 

sm{x + y) = sin(x) cos(?/) + cos(x) sin(?/), 

tan(a:) + tan(-u) . . , „n 

tan(x + y) = (4.17) 

^ 1 - tan(x) tan(y) ^ ^ 

We use these formulae to reduce the argument so that power series converge 
more rapidly. Usually we take x = y to get doubling formulae such as 

exp(2x) = exp(x)^ (4.18) 

though occasionally tripling formulae such as 

sin(3a;) = 3sin(a;) — 4sin'^(x) 

might be useful. This tripling formula only involves one function (sin), 
whereas the doubling formula sin(2x) = 2 sin a; cos x involves two functions 
(sin and cos), but this problem can be overcome: see §4.3.41 and §4.9.11 

We usually distinguish two kinds of argument reduction: 

• additive argument reduction, where x' = x — kc, for some real constant 
c and some integer k. This occurs in particular when f{x) is periodic, 
for example for the sine and cosine functions with c = 27r; 
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• multiplicative argument reduction, where x' = x/c^ for some real con- 
stant c and some integer k. This occurs with c = 2 in the computation 
of expx when using the doubhng formula (14.181) : see §4.3.11 

Note that, for a given function, both kinds of argument reduction might be 
available. For example, for sinx, one might either use the tripling formula 
sin(3x) = 3sinx — 4sin^x, or the additive reduction sin(a: + 2/c7r) = sinx 
that arises from the periodicity of sin. 

Sometime "reduction" is not quite the right word, since a functional iden- 
tity is used to increase rather than to decrease the argument. For example, 
the Gamma function T{x) satisfies an identity 

xV{x) = r(x + 1), 

that can be used repeatedly to increasethe argument until we reach the region 
where Stirling's asymptotic expansion is sufficiently accurate, see §4.51 

4.3.1 Repeated Use of a Doubling Formula 

If we apply the doubling formula fl4.18p for the exponential function k times, 
we get 

exp(x) = exp(x/2'')^\ 

Thus, if |x| = 6(1), we can reduce the problem of evaluating exp(x) to that 
of evaluating exp(x/2^), where the argument is now 0(2^^). This is better 
since the power series converges more quickly for x/2^. The cost is the k 
squarings that we need to reconstruct the final result from exp(x/2'^'). 

There is a trade-off here, and k should be chosen to minimise the total 
time. If the obvious method for power series evaluation is used, then the 
optimal k is of order y/n and the overall time is 0{n^^'^M{n)). We shall see 
in §4.4.31 that there are faster ways to evaluate power series, so this is not 
the best possible result. 

We assumed here that |a;| = 0(1). A more careful analysis shows that 
the optimal k depends on the order of magnitude of x (see Exercise 14. 5p . 

4.3.2 Loss of Precision 

For some power series, especially those with alternating signs, a loss of pre- 
cision might occur due to a cancellation between successive terms. A typical 
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example is the series for exp(x) when x < 0. Assume for example that we 
want 10 significant digits of exp(— 10). The first ten terms x'^/kl for x = —10 
are approximately: 

1., -10., 50., -166.6666667, 416.6666667, -833.3333333, 1388.888889, 
-1984.126984, 2480.158730, -2755.731922. 

Note that these terms alternate in sign and initially increase in magnitude. 
They only start to decrease in magnitude for k > \x\. If we add the first 51 
terms with a working precision of 10 decimal digits, we get an approximation 
to exp(— 10) that is only accurate to about 3 digits! 

A much better approach is to use the identity 

exp(x) = 1/ exp(— x) 

to avoid cancellation in the power series summation. In other different 
power series without sign changes might exist for a closely related function: 
for example, compare the series fl4.22p and fl4.23|] for computation of the error 
function erf(x). See also Exercises I4.19fl4?20l 

4.3.3 Guard Digits 

Guard digits are digits in excess of the number of digits that are required in 
the final answer. Generally, it is necessary to use some guard digits during 
a computation in order to obtain an accurate result (one that is correctly 
rounded or differs from the correctly rounded result by a small number of 
units in the last place). Of course, it is expensive to use too many guard 
digits. Thus, care has to be taken to use the right number of guard digits, 
that is the right working precision. Here and below, we use the generic term 
"guard digits", even for radix /3 = 2. 

Consider once again the example of expx, with reduced argument x/2'' 
and X = 0(1). Since x/2'' is 0(2"'^), when we sum the power series 
1 + x/2'^ + ■ ■ ■ from left to right (forward summation), we "lose" about k bits 
of precision. More precisely, if x/2^ is accurate to n bits, then 1 + x/2'' is 
accurate to n + k bits, but if we use the same working precision n, we obtain 
only n correct bits. After squaring k times in the reconstruction step, about 
k bits will be lost (each squaring loses about one bit), so the final accuracy 
will be only n — k bits. If we summed the power series in reverse order instead 
(backward summation), and used a working precision of n + when adding 



Modern Computer Arithmetic, §4.4 



147 



1 and + ■ ■ ■ and during the squarings, we would obtain an accuracy of 
n + k bits before the k squarings, and an accuracy of n bits in the final result. 

Another way to avoid loss of precision is to evaluate expml(x/2'^), where 
the function expml is defined by 

expml(x) = exp(x) — 1 

and has a doubling formula that avoids loss of significance when |x| is small. 
See Exercises OHOl 

4.3.4 Doubling versus Tripling 

Suppose we want to compute the function sinh(x) = (e^ — e~^)/2. The 
obvious doubling formula for sinh, 

sinh(2x) = 2sinh(x) cosh(a;), 

involves the auxiliary function cosh(x) = (e^ + e~^)/2. Since cosh^(x) — 
sinh^(2;) = 1, we could use the doubling formula 



sinh(2x) = 2sinh(x)Y 1 + sinh^(x), 

but this involves the overhead of computing a square root. This suggests 
using the tripling formula 

sinh(3x) = sinh(x)(3 + 4sinh^(x)). (4.19) 

However, it is usually more efficient to do argument reduction via the dou- 
bling formula (14.181) for exp, because it takes one multiplication and one 
squaring to apply the tripling formula, but only two squarings to apply the 
doubling formula twice (and 3 < 2^). A drawback is loss of precision, caused 
by cancellation in the computation of exp(x) — exp(— x), when \x\ is small. 
In this case it is better to use (see Exercise 14.101) 

sinh(x) = (expml(x) — expml (—x))/2. (4.20) 

See §4.121 for further comments on doubling versus tripling, especially in the 
FFT range. 
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4.4 Power Series 

Once argument reduction has been applied, where possible ( §4.3p . one is 
usually faced with the evaluation of a power series. The elementary and 
special functions have power series expansions such as: 

^^P^ = Z^-i' ln(l+x) = }^ , 

j>o i>o 

arctan x = > ? smhx = > -r ? etc. 

^ 2j + l ^ 2j + l ! 

This section discusses several techniques to recommend or to avoid. We use 
the following notations: x is the evaluation point, n is the desired precision, 
and d is the number of terms retained in the power series, or d — 1 is the 
degree of the corresponding polynomial X]o<j<d 

If j[x) is analytic in a neighbourhood of some point c, an obvious method 
to consider for the evaluation of /(x) is summation of the Taylor series 

^ ■ f^[c\ 

j=o 

As a simple but instructive example we consider the evaluation of exp(x) 
for |a;| < 1, using 

d-i j 

exp(x) = ^ ^ + Ra{x), (4.21) 



7 

3=0 ■' 



where \Rd[x)\ < |x|'^exp(|x|)/(i! < e/d\. 

Using Stirling's approximation for rf!, we see that d > K{n) ~ n/\gn 
is sufficient to ensure that \Rd{x) = 0{2~'^). Thus, the time required to 
evaluate (14.211) with Horner's rulqj is 0{nM{n) / log n). 



^By Horner's rule (with argument x) we mean evaluating the polynomial sq = 
So<j<d of degree d (not d — 1 in this footnote) by the recurrence Sd — cid, 

^3 — (^j + ^j+i^' for j ~ d — l,d — 2, . . . ,0. Thus = Sfc<i<d '^j^"' '^^ evalua- 
tion by Horner's rule takes d additions and d multiplications, and is more efficient than 
explicitly evaluating the individual terms ajX^ . 
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In practice it is convenient to sum the series in the forward direction 
(j = 0, 1, . . . , (i — 1). The terms tj = / j\ and partial sums 

i 

Sj = ^ ti 

may be generated by the recurrence tj = xtj_i/j, Sj = Sj^i + tj, and the 
summation terminated when \tfi\ < 2~"/e. Thus, it is not necessary to esti- 
mate d in advance, as it would be if the series were summed by Horner's rule 
in the backward direction [j = d — l,d — 2, . . . ,0) (see however Exercise 14.41) . 

We now consider the effect of rounding errors, under the assumption that 
floating-point operations are correctly rounded, i.e., satisfy 

o(x op y) = (x op y){l + 6), 

where \5\ < e and "op" = "+", "x" or Here e = 2"^^ is the 

"machine precision" or "working precision" . Let tj be the computed value of 
tj, etc. Thus 

\tj-tj \ I \tj\ < 2je + 0{e'^) 
and using X]j=o^i = Sd < e: 

d 

\Sd-Sd\ < dee + J2 "^J^ltjl+Oie^) 

< {d + 2)ee + 0{e^) = 0{ne). 

Thus, to get |5d - 5d| = 0(2-"), it is sufficient that e = 0(2"'"/^). In 
other words, we need to work with about Ign guard digits. This is not a 
significant overhead if (as we assume) the number of digits may vary dynam- 
ically. We can sum with j increasing (the forward direction) or decreasing 
(the backward direction). A slightly better error bound is obtainable for 
summation in the backward direction, but this method has the disadvan- 
tage that the number of terms d has to be decided in advance (see however 
Exercise 14.41) . 

In practice it is inefficient to keep the working precision e fixed. We can 
profitably reduce it when computing tj from tj_i if \tj-i\ is small, without 
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significantly increasing the error bound. We can also vary the working preci- 
sion when accumulating the sum, especially if it is computed in the backward 
direction (so the smallest terms are summed first). 

It is instructive to consider the effect of relaxing our restriction that 
|x| < 1. First suppose that x is large and positive. Since \tj\ > 
when j < it is clear that the number of terms required in the sum 04.21 p 
is at least of order \x\. Thus, the method is slow for large |x| (see §4.31 for 
faster methods in this case). 

If |a;| is large and x is negative, the situation is even worse. From Stirling's 
approximation we have 

exp Ixl 
max \t~\ ~ — , 

J>o \/27r\x\ 



but the result is exp(— |x|), so about 2|x|/log2 guard digits are required to 
compensate for what Lehmer called "catastrophic cancellation" [SI]. Since 
exp(x) = 1/ exp(— x), this problem may easily be avoided, but the corre- 
sponding problem is not always so easily avoided for other analytic functions. 

Here is a less trivial example. To compute the error function 



2 r _ 2 

erf(x) = —= / e " du, 

A Jo 



we may use either the power series 



erf(., = ^ ± (zHi^ (4.22) 



or the (mathematically, but not numerically) equivalent 



2xe-^' ^ 2^ x2^ 



erf (x) = y —^^ ^ . (4.23) 

^ ^ ^^F ^ l-3-5---(2j + l) ^ ^ 



For small |x|, the series fl4.22p is slightly faster than the series (jj 



because there is no need to compute an exponential. However, the se- 
ries fl4.23p is preferable to (14.220 for moderate |x| because it involves no 
cancellation. For large |x| neither series is satisfactory, because f2(x^) terms 
are required, and in this case it is preferable to use the asymptotic expan- 
sion for erfc(x) = 1 — erf(x): see §4.51 In the borderline region use of the 
continued fraction fl4.40p could be considered: see Exercise 14.311 
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In the following subsections we consider different methods to evaluate 
power series. We generally ignore the effect of rounding errors, but the 
results obtained above are typical. 

Assumption about the Coefficients 

We assume in this section that we have a power series Ylj>o^j^'' where 
aj+s/dj is a rational function R{j) of j, and hence it is easy to evaluate 
Qq, Qi, 02, . . . sequentially. Here 5 is a fixed positive constant, usually 1 or 2. 
For example, in the case of exp x, we have 6 = 1 and 

Qj+i ^ j! ^ 1 
aj ~ (j + 1)! ~ J + 1 ■ 

Our assumptions cover the common case of hypergeometric functions. For 
the more general case of holonomic functions, see §4.9.21 

In common cases where our assumption is invalid, other good methods 
are available to evaluate the function. For example, tanx does not satisfy our 
assumption (the coefficients in its Taylor series are called tangent numbers 
and are related to Bernoulli numbers - see §4.7.2p . but to evaluate tanx we 
can use Newton's method on the inverse function (arctan, which does satisfy 
our assumptions - see §4.2.5p . or we can use tan a; = sinx/ cosx. 

The Radius of Convergence 

If the elementary function is an entire function (e.g., exp, sin) then the power 
series converges in the whole complex plane. In this case the degree of the 
denominator of R{j) = aj+i/aj is greater than that of the numerator. 

In other cases (such as In, arctan) the function is not entire. The power 
series only converges in a disk because the function has a singularity on the 
boundary of this disk. In fact \n{x) has a singularity at the origin, which is 
why we consider the power series for ln(l + x). This power series has radius 
of convergence 1. 

Similarly, the power series for arctan(x) has radius of convergence 1 be- 
cause arctan(x) has singularities on the unit circle (at ±i) even though it is 
uniformly bounded for all real x. 
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4.4.1 Direct Power Series Evaluation 

Suppose that we want to evaluate a power series ^j>o cij^'' at a given argu- 
ment X. Using periodicity (in the cases of sin, cos) and7or argument reduction 
techniques ( §4.3p . we can often ensure that is sufficiently small. Thus, let 
us assume that |x| < 1/2 and that the radius of convergence of the series is 
at least 1. 

As above, assume that aj^s/dj is a rational function of j, and hence easy 
to evaluate. For simplicity we consider only the case 6=1. To sum the 
series with error 0(2~") it is sufficient to take n + 0(1) terms, so the time 
required is 0{nM{n)). If the function is entire, then the series converges 
faster and the time is reduced to O (nM(n) /(log n)). However, we can do 
much better by carrying the argument reduction further, as demonstrated in 
the next section. 

4.4.2 Power Series With Argument Reduction 

Consider the evaluation of exp(x). By applying argument reduction k + 0{l) 
times, we can ensure that the argument x satisfies |x| < 2~^. Then, to 
obtain n-bit accuracy we only need to sum 0{n/k) terms of the power series. 
Assuming that a step of argument reduction is 0{M{n)), which is true for 
the elementary functions, the total cost is 0{{k + n/k)M{n)). Indeed, the 
argument reduction and/or reconstruction requires 0{k) steps of 0(M(n)), 
and the evaluation of the power series of order n/k costs {n/k)M{n); so 
choosing k ~ n^^^ gives cost 



For example, our comments apply to the evaluation of exp(x) using 



0{ 



n 



'^M{n)) . 



exp(x) = exp(x/2)^ 



to loglp(x) = ln(l + x) using 



loglp(x) = 2 loglp 




and to arctan(x) using 
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Note that in the last two cases each step of the argument reduction requires 
a square root, but this can be done with cost 0{M{n)) by Newton's method 
( §3.5p . Thus in all three cases the overall cost is 0{n^^'^M{n)), although 
the implicit constant might be smaller for exp than for loglp or arctan. See 
Exercises OHOl 

Using Symmetries 

A not-so-well-known idea is to evaluate ln(l + x) using the power series 



with y defined by (1 + ?/)/(! — y) = 1 + x, i.e., y = x/{2 + x). This saves 
half the terms and also reduces the argument, since y < x/2 ii x > 0. 
Unfortunately this nice idea can be applied only once. For a related example, 
see Exercise I4.11[ 

4.4.3 Rectangular Series Splitting 

Once we determine how many terms in the power series are required for the 
desired accuracy, the problem reduces to evaluating a truncated power series, 
i.e., a polynomial. 

Let P{x) = 'Ylio<j<d^j^^ polynomial that we want to evaluate, 

deg(P) < d. In the general floating-point number of n bits, and 

we aim at an accuracy of n bits for P{x). However the coefficients aj, or 
their ratios R{j) = aj+i/aj, are usually small integers or rational numbers 
of O(logn) bits. A scalar multiplication involves one coefficient aj and the 
variable x (or more generally an ra-bit floating-point number), whereas a non- 
scalar multiplication involves two powers of x (or more generally two n-bit 
floating-point numbers). Scalar multiplications are cheaper because the aj 
are small rationals of size O(logn), whereas x and its powers generally have 
6(n) bits. It is possible to evaluate P{x) with 0{y/n) nonscalar multipli- 
cations (plus 0{n) scalar multiplications and 0{n) additions, using 0{^/n) 
storage). The same idea applies, more generally, to evaluation of hypergeo- 
metric functions. 
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Classical Splitting 

Suppose d = jk, define y = x^, and write 

i-i fc-i 
Pi^) = ^y^Pei^) where Pi{x) = ^aki+mx""- 

1=0 m=0 

One first computes the powers x^, x^, . . . , x^"^, x^ = y; then the polynomials 
Pe{x) are evaluated simply by multiplying ake+m and the precomputed x"^ (it 
is important not to use Horner's rule here, since this would involve expensive 
nonscalar multiplications). Finally, P(x) is computed from the Pe{x) using 
Horner's rule with argument y. To see the idea geometrically, write P(x) as 

?/° [ao + oix + a2X^ + • • • + a^-ix''"^] + 
y^ [ak + Ofc+ix + ak+2X^ + ■ ■■ + a2fc-ix^^^] + 
[a2k + a2k+ix + a2k+2x'^ + ■ ■■ + asfc-ix^'"^] + 

y-'"^ [a{j-i)k + a(j„i)fc+ix + a(^j_i)k+2x'^ + ■■■ + ajk-ix'''^] 

where y = x^ . The terms in square brackets are the polynomials Po{x), 
Pi(x), P,_i(x). 

As an example, consider d = 12, with j = 3 and k = 4. This gives 
Po{x) = ao + aix + a2X^ + a^x^, Pi(x) = 04 + asx + a^x"^ + ayx^, P2(x) = 
as + agx + aiox^ + anx^, then P(x) = Po(x) + yPi(x) + ?/^P2(x), where 
y = x^. Here we need to compute , which requires three nonscalar 

products — note that even powers like x^ should be computed as (x^)^ to 
use squarings instead of multiplies — and we need two nonscalar products to 
evaluate P(x), thus a total of five nonscalar products, instead oi d — 2 = 10 
with a naive application of Horner's rule to P(x)Jfl 

Modular Splitting 

An alternate splitting is the following, which may be obtained by transposing 
the matrix of coefficients above, swapping j and k, and interchanging the 
powers of x and y. It might also be viewed as a generalized odd-even scheme 



^P{x) has degree c? — 1, so Horner's rule performs d — 1 products, but the first one 
X X ad-i is a scalar product, hence there are d — 2 nonscalar products. 
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( §1.3.5p . Suppose as before that d = jk, and write, with y = x^: 

j-i k-i 
P{x) = Y,^'Pi{y) where P,(y) = J] a,„+, y'". 

l=Q m=0 

First compute y = x\y'^,y^, . . . jy'^'^. Now the polynomials Pe{y) can be 
evaluated using only scalar multiplications of the form ajm+i x y™- 
To see the idea geometrically, write P{x) as 

x° [flo + CLjy + a2jy'^ + ■■■] + 
[fli + ttj+iy + a2j+iy^ + ■ ■ ■ ] + 
[a2 + aj+2y + a.2j+2y'^ + ■ ■ ■ ] + 

x^"^ [aj_i + a2j-iy + a-^j-iy"^ + ■■■] 

where y = xK We traverse the first row of the array, then the second row, 
then the third, . . ., finally the j'-th row, accumulating sums 5*0, 5*1, . . . , Sj-i 
(one for each row). At the end of this process Sg = Pe{y) and we only have 
to evaluate 

P{x) = Y,^'Se. 
e=o 

The complexity of each scheme is almost the same (see Exercise 14.121) . With 
d = 12 {j = 3 and /c = 4) we have Po{y) = Oo + (^sV + (^gv'^ + '^gy'^, Pi{y) = 
ai + a^y + a^y"^ + ai^y^, ^2(2/) = 0,2 + cl^V + c^gy^ + CLuy^- We first compute 
y = x^, y^ and y^ , then we evaluate -Po(l/) in three scalar multiplications 03?/, 
aei/^, and ag?/^ and three additions, similarly for P\ and P2, and finally we 
evaluate P{x) using 

P(x) = Po(y)+a;Pi(y) + x2p2(z/), 

(here we might use Horner's rule). In this example, we have a total of six 
nonscalar multiplications: four to compute y and its powers, and two to 
evaluate P(x). 



Complexity of Rectangular Series Splitting 



To evaluate a polynomial P[x) of degree d — \ = jk — 1, rectangular series 
splitting takes 0{j + k) nonscalar multiplications — each costing 0{M{n)) — 
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and 0{jk) scalar multiplications. The scalar multiplications involve multi- 
plication and/or division of a multiple-precision number by small integers. 
Assume that these multiplications and/or divisions take time c{d)n each (see 
Exercise 14.131 for a justification of this assumption). The function c{d) ac- 
counts for the fact that the involved scalars (the coefficients aj or the ratios 
ttj+i/aj) have a size depending on the degree d of P{x). In practice we can 
usually regard c{d) as constant. 

Choosing j k d^/^ we get overall time 

0{d^/^M{n) +dn-c{d)). (4.24) 

If d is of the same order as the precision n of x, this is not an improvement 
on the bound 0(n^/^M(n)) that we obtained already by argument reduction 
and power series evaluation ( §4.4.21) . However, we can do argument reduction 
before applying rectangular series splitting. Assuming that c(n) =0(1) (see 
Exercise 14. 141 for a detailed analysis), the total complexity is: 

T(n) = O [^M{n) + d^'^M{n) + dn) , 

where the extra (n/d)M{n) term comes from argument reduction and/or 
reconstruction. Which term dominates? There are two cases: 

1. M{n) ^ n^/^ . Here the minimum is obtained when the first two terms 

— argument reduction/reconstruction and nonscalar multiplications — 
are equal, i.e., for d ~ n^/^, which yields T{n) = 0{n^^^M{n)). This 
case applies if we use classical or Karatsuba multiplication, since Ig 3 > 
4/3, and similarly for Toom-Cook 3-, 4-, 5-, or 6-way multiplication 
(but not 7- way, since logy 13 < 4/3). In this case T{n) ^ n^^^. 

2. M{n) <^ n^/"^. Here the minimum is obtained when the first and the last 
terms — argument reduction/reconstruction and scalar multiplications 

— are equal. The optimal value of d is then ^J M{n), and we get 
an improved bound Q(n^y M(n)) ^ n^/^. We can not approach the 
0{n^^^) that is achievable with AGM-based methods (if applicable) - 
see Ml 
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4.5 Asymptotic Expansions 

Often it is necessary to use different methods to evaluate a special function 
in different parts of its domain. For example, the exponential integral^ 

E,(x)= r^-^^du (4.25) 

Jx U 

is defined for all x > 0. However, the power series 



Ei(x) +7 + lna; = V ^^^^ — — (4.26) 

j=l J J 

is unsatisfactory as a means of evaluating Ei (x) for large positive x, for the 
reasons discussed in §4.41 in connection with the power series (14.221) for erf (x), 
or the power series for exp(x) (x negative). For sufficiently large positive x 
it is preferable to use 

Ei(x) = ~ ^^11"^^'" + ^'^(^)' (4.27) 



where 



Note that 



so 



i?fc(x) = A:! (-1)^- exp(x) H ^^^^LA du. (4.28) 

J X 



lim Rk{x) = 0, 

x—^+oo 



but limfc_>.oo Rk{x) does not exist. In other words, the series 

(j-i)!(-ir^ 



oo 



^The functions Ei(a;) and Ei(a;) = PYj^^{exp{t)/t)dt are both called "exponential 
integrals". Closely related is the "logarithmic integral" li(x) =Ei(lnx) = PV/p^(l/ Int) dt. 
Here the integrals PV / • • • should be interpreted as Cauchy principal values if there is a 
singularity in the range of integration. The power scries (j4.26p is valid for x G C if 
I argx| < TT (see Exercise 14. 16p . 
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is divergent. In such cases we call this an asymptotic series and write 

i>0 

Although they do not generally converge, asymptotic series are very useful. 
Often (though not always!) the error is bounded by the last term taken in the 
series (or by the first term omitted). Also, when the terms in the asymptotic 
series alternate in sign, it can often be shown that the true value lies between 
two consecutive approximations obtained by summing the series with (say) 
k and k + 1 terms. For example, this is true for the series fl4.29p above, 
provided x is real and positive. 

When X is large and positive, the relative error attainable by using fl4.27p 
with k = [x\ is 0(x"'^/^ exp(— x)), because 

\Rk{k)\ < k\/k^^^ = 0{k-^/^exp{-k)) (4.30) 

and the leading term on the right side of fl4.27p is 1/x. Thus, the asymp- 
totic series may be used to evaluate Ei(x) to precision n whenever x > 
nln2 + O(lnn). More precise estimates can be obtained by using a version 
of Stirling's approximation with error bounds, for example 



V27Fk <k\ < (-] V27pk 



exp 



ej \e J \12k 

If X is too small for the asymptotic approximation to be sufficiently accurate, 
we can avoid the problem of cancellation in the power series ( 14.26^ by the 
technique of Exercise 14.191 However, the asymptotic approximation is faster 
and hence is preferable whenever it is sufficiently accurate. 

Examples where asymptotic expansions are useful include the evaluation 
of erfc(x), r(x), Bessel functions, etc. We discuss some of these below. 

Asymptotic expansions often arise when the convergence of series is accel- 
erated by the Euler-Maclaurin sum formula^ For example, Euler's constant 



^ The Euler-Maclaurin sum formula is a way of expressing the difference between a sum 
and an integral as an asymptotic expansion. For example, assuming that a € Z, 6 € Z, 
a < b, and f{x) satisfies certain conditions, one form of the formula is 

a<k<b •^'^ k>l ' 



Often we can let b — > +oo and omit the terms involving b on the right-hand-side. For 
more information see §4.121 
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7 is defined by 



7 = lim {HN-\nN), (4.31) 



where if at = J2i<j<N Vj is a harmonic number. However, Eqn. (I4.3ip con- 
verges slowly, so to evaluate 7 accurately we need to accelerate the conver- 
gence. This can be done using the Euler-Maclaurin formula. The idea is to 
split the sum if^r into two parts: 



^ 1 

Hiv = Hp^i + - • 



IN 

' ' J 
J=P 



We approximate the second sum using the Euler-Maclaurin formula^ with 
a = p, b = N, f{x) = 1/x, then let N — )■ -|-oo. The result is 

7-H,-lnp + J2^P~''- (4.32) 

fc>i 

If p and the number of terms in the asymptotic expansion are chosen judi- 
ciously, this gives a good algorithm for computing 7 (though not the best 
algorithm: see §4.121 for a faster algorithm that uses properties of Bessel 
functions). 

Here is another example. The Riemann zeta-function ({s) is defined for 
s e C, 3f?(s) > 1, by 

00 

c(^) = E (4-33) 

and by analytic continuation for other s ^ 1. ({s) may be evaluated to any 
desired precision if m and p are chosen large enough in the Euler-Maclaurin 
formula 



j=l k=l 



where 



„ 2k-2 

\Em,p{s)\ < \Tm+iAs) {s + 2m+ l)/{a + 2m + 1)|, (4.36) 
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m > 0, p > 1, a = 3?(s) > —(2m + 1), and the i?2fc are Bernoulli numbers. 

In arbitrary-precision computations we must be able to compute as many 
terms of an asymptotic expansion as are required to give the desired accuracy. 
It is easy to see that, if m in fl4.34p is bounded as the precision n goes to 
oo, then p has to increase as an exponential function of n. To evaluate ({s) 
from fl4.34p to precision n in time polynomial in n, both m and p must tend to 
infinity with n. Thus, the Bernoulli numbers B2, . . . , can not be stored in 
a table of fixed size|f]but must be computed when needed (see §4.7p . For this 
reason we can not use asymptotic expansions when the general form of the 
coefficients is unknown or the coefficients are too difficult to evaluate. Often 
there is a related expansion with known and relatively simple coefficients. 
For example, the asymptotic expansion (14.381) for Inr(x) has coefficients 
related to the Bernoulli numbers, like the expansion (I4.34p for C,{s), and thus 
is simpler to implement than Stirling's asymptotic expansion for r(a:) (see 
Exercise 102]). 

Consider the computation of the error function erf (x). As seen in §4.4[ the 
series (14.220 and (14.230 are not satisfactory for large since they require 
f2(x^) terms. For example, to evaluate erf (1000) with an accuracy of six 
digits, Eqn. (I4.22p requires at least 2 718 279 terms! Instead, we may use an 
asymptotic expansion. The complementary error function erfc(a;) = 1— erf(x) 
satisfies 

erfc(x) ~ ^ ^(_i)^M(2a;)-2^-, (4.37) 

with the error bounded in absolute value by the next term and of the same 
sign. In the case x = 1000, the term for j = 1 of the sum equals —0.5 x 10~^; 
thus / (x-^/tt) is an approximation to erfc(x) with an accuracy of six digits. 
Because erfc(lOOO) ~ 1.86 x io~'^34298 jg ygj^y small, this gives an extremely 
accurate approximation to erf (1000). 

For a function like the error function where both a power series (at x = 0) 
and an asymptotic expansion (at x = oc) are available, we might prefer to 
use the former or the latter, depending on the value of the argument and 
on the desired precision. We study here in some detail the case of the error 
function, since it is typical. 

®In addition, we would have to store them as exact rationals, taking ^ Ig m bits 
of storage, since a floating-point representation would not be convenient unless the target 
precision n were known in advance. See §4.7.21 and Exercise 14.371 
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The sum in (14.370 is divergent, since its j-th term is ~ \/2[i / ex^y . We 
need to show that the smallest term is 0(2"") in order to be able to deduce 
an n-bit approximation to erfc(x). The terms decrease while j < + ^I'^i 
so the minimum is obtained for j f« x^, and is of order , thus we need 
X > \/n\n2. For example, for n = 10^ bits this yields x > 833. However, 
since erfc(a;) is small for large x, say erfc(x) ~ 2~^, we need only m = n — \ 
correct bits of erfc(a;) to get n correct bits of erf (x) = 1 — erfc(x). 

Consider x fixed and j varying in the terms in the sums (14.221) and (14.371) . 
For j < x"^, x'^^/jl is an mcreasmg' function of j, but (2j)!/(j!(4x^)-') is a de- 
creasing function of j. In this region the terms in Eqn. (I4.37P are decreasing. 
Thus, comparing the series (I4.22p and (I4.37p . we see that the latter should 
always be used if it can give sufficient accuracy. Similarly, (I4.37P should if 
possible be used in preference to (I4.23p . as the magnitudes of corresponding 
terms in (14.220 and in (I4.23P are similar. 



Algorithm 4.2 Erf 

Input: positive floating-point number x, integer n 
Output: an n-bit approximation to erf(x) 
m ^ \n - {x^ + Id.x + (In vr) /2)/(ln 2)] 
if (m + 1/2) ln(2) < then 

t <(— erfc(x) with the asymptotic expansion (I4.37p and precision m 
return 1 — t (in precision n) 
else if X < 1 then 

compute erf(x) with the power series (14.220 in precision n 
else 

compute erf(x) with the power series (14.231) in precision n. 



Algorithm Erf computes erf (x) for real positive x (for other real x, use 
the fact that erf (x) is an odd function, so erf (— x) = — erf (x) and erf (0) = 0). 
In Algorithm Erf, the number of terms needed if Eqn. (14.220 or Eqn. ( 14.23P 
is used is approximately the unique positive root jo (rounded up to the next 
integer) of 

j(ln j — 2 Inx — 1) = nln2, 

so jo > ex^. On the other hand, if Eqn. (I4.37P is used, then the number 
of terms k < x^ + 1/2 (since otherwise the terms start increasing). The 
condition (m + 1/2) ln(2) < x^ in the algorithm ensures that the asymptotic 
expansion can give m-bit accuracy. 
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Here is an example: for x = 800 and a precision of one million bits, 
Equation f l4.23p requires about jo = 2 339 601 terms. Eqn. fl4.37p tells us 
that erfc(a;) ~ 2~^^'^^^^; thus we need only m = 76 665 bits of precision for 
erfc(x); in this case Eqn. (14.371) requires only about k = 10 375 terms. Note 
that using Eqn. fl4.22p would be slower than using Eqn. f l4.23p . because we 
would have to compute about the same number of terms, but with higher 
precision, to compensate for cancellation. We recommend using Eqn. fl4.22p 
only if |a;| is small enough that any cancellation is insignificant (for example, 
if \x\ < 1). 

Another example, closer to the boundary: for x = 589, still with n = 10^, 
we have m = 499 489, which gives jo = 1497924, and k = 325 092. For 
somewhat smaller x (or larger n) it might be desirable to use the continued 
fraction (14.400 . see Exercise 14.311 

Occasionally an asymptotic expansion can be used to obtain arbitrarily 
high precision. For example, consider the computation of Inr(x). For large 
positive X, we can use Stirling's asymptotic expansion 



(l\ 1 f2 ) "1—1 

2 j '° " - " + ^ + g 2M2t - iV^^'- ^ 
where Rmix) is less in absolute value than the first term neglected, that is 

2m(2m - l)^^™-! ' 

and has the same sig n| The ratio of successive terms tk and t^+i of the sum 
is 

^fc+i / k 



so the terms start to increase in absolute value for (approximately) k > nx. 
This gives a bound on the accuracy attainable, in fact 

\n\Rmix)\ > -27rxln(x) + 0(x). 

However, because T{x) satisfies the functional equation r(x+ 1) = xr{x), we 
can take x' = x + 6 for some sufficiently large 5 G N, evaluate Inr(x') using 
the asymptotic expansion, and then compute Inr(x) from the functional 
equation. See Exercise I4.21[ 



^The asymptotic expansion is also valid for x £ C, \ a.rgx\ < tt, x ^ 0, but the bound 
on the error term Rmix) in this case is more complicated. Sec for example [Tl 6.1.42]. 
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4.6 Continued Fractions 

In §4.51 we considered the exponential integral Ei(a;). This can be computed 
using the continued fraction 

1 



X 



1 + 



X 



X + 



1 



Writing continued fractions in this way takes a lot of space, so instead we 
use the shorthand notation 



EAx) = — — — — — ^ (4.39) 

^ ' x+ 1+ x+ 1+ a;+ 1+ 



Another example is 



erfcW^f^;:] ii/H2ZH5Z?lZ25Z2.... (4.40) 

\ a/tt / x+ x+ x+ x+ x+ x+ 

Formally, a continued fraction 

bi+ 62+ 03+ 

is defined by two sequences (aj)jgN. and (6j)jgN, where aj, bj G C. Here 
C = CU{oo}is the set of extended complex numbers^ The expression / is 
defined to be limfc_!.oo fk, if the limit exists, where 

ai a2 as at i a aa\ 

61+ 62+ O3+ Ofc 

is the finite continued fraction — called the k-th approximant — obtained by 
truncating the infinite continued fraction after k quotients. 



^"Arithmetic operations on C are extended to C in the obvious way, for example 1/0 
l + oo = lxcx3 = 00, l/oo = 0. Note that 0/0, x 00 and 00 ± 00 are undefined. 
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Sometimes continued fractions are preferable, for computational pur- 
poses, to power series or asymptotic expansions. For example, Euler's contin- 
ued fraction f l4.39p converges for all real x > 0, and is better for computation 
of Ei{x) than the power series (14.261) in the region where the power series 
suffers from catastrophic cancellation but the asymptotic expansion fl4.27p is 
not sufficiently accurate. Convergence of fl4.39p is slow if x is small, so fl4.39p 
is preferred for precision n evaluation of Ei[x) only when x is in a certain 
interval, say x G (cin, C2n), ci ~ 0.1, C2 = ln2 0.6931 (see Exercise 14.241) . 

Continued fractions may be evaluated by either forward or backward re- 
currence relations. Consider the finite continued fraction 

bi+ 62+ 03+ bk 
The backward recurrence is Rk = I, Rk-i = bk, 

Rj = bj+i Rj+i + aj+2 Rj+2 {] = k-2,...,0), (4.43) 
and y = aiRi/R^, with invariant 

Rj _ 1 O-j+l O-k 

Rj-i bj+ bk 
The forward recurrence is Pq = 0, Pi = Oi, Qq = 1, Qi = bi, 



Qj = bj Qj-i + cij Qj-2 



{j = 2,...,k), (4.44) 



and y = Pk/Qk (see Exercise I4.26p . 

The advantage of evaluating an infinite continued fraction such as (I4.39P 
via the forward recurrence is that the cutoff k need not be chosen in advance; 
we can stop when \Dk\ is sufficiently small, where 

Du = ^-^- (4.45) 

The main disadvantage of the forward recurrence is that twice as many arith- 
metic operations are required as for the backward recurrence with the same 
value of k. Another disadvantage is that the forward recurrence may be less 
numerically stable than the backward recurrence. 

If we are working with variable-precision fioating-point arithmetic which 
is much more expensive than single-precision fioating-point, then a useful 
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strategy is to use the forward recurrence with single-precision arithmetic 
(scaled to avoid overflow/underflow) to estimate k, then use the backward 
recurrence with variable-precision arithmetic. One trick is needed: to evalu- 
ate Dk using scaled single-precision we use the recurrence 

Di = ai/bi, 

Dj = -ajQj-2Dj-i/ Qj (j = 2,3,...) 

which avoids the cancellation inherent in (14. 45 p . 

By analogy with the case of power series with decreasing terms that al- 
ternate in sign, there is one case in which it is possible to give a simple 
a posteriori bound for the error occurred in truncating a continued fraction. 
Let / be a convergent continued fraction with approximants fk as in (14.410 . 
Then 

Theorem 4.6.1 // aj > and bj > for all j G N*, then the sequence 
(/2fc)fceN of even order approximants is strictly increasing, and the sequence 
(/2fc+i)fceN of odd order approximants is strictly decreasing. Thus 

f2k < f < /2fc+l 

and 

r fm~l ~l" fm 

^ 2 

for all m G M* . 

In general, if the conditions of Theorem 14.6.11 are not satisfied, then it 
is difficult to give simple, sharp error bounds. Power series and asymptotic 
series are usually much easier to analyse than continued fractions. 




< 



fm fi 



m— 1 



4.7 Recurrence Relations 

The evaluation of special functions by continued fractions is a special case 
of their evaluation by recurrence relations. To illustrate this, we consider 
the Bessel functions of the first kind, Jy{x). Here v and x can in general be 
complex, but we restrict attention to the case v & 'L, x & The functions 
Ju{x) can be defined in several ways, for example by the generating function 
(elegant but only useful for z/ G Z): 
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or by the power series (also valid if z/ ^ Z): 



We also need Bessel functions of the second kind (sometimes called Neumann 
functions or Weber functions) Y^{x), which may be defined by: 

= lim U^)-o<-f^)-J-.i^) . (4.49) 
fi^f sin(7r/i) 

Both J^{x) and y^(x) are solutions of Bessel's differential equation 

x'^y" + xy' + {x^ - u'^)y = 0. (4.50) 



4.7.1 Evaluation of Bessel Functions 

The Bessel functions Ju{x) satisfy the recurrence relation 



Ju-i{x) + J^+i{x) = —Jy{x). (4.51) 

X 



Dividing both sides by Jjy(x), we see that 

Ju-i{x) 2v I J^{x) 



Ju{x) X I t/i/+l(x) 

which gives a continued fraction for the ratio J y{x) j J y_\{x) [y > 1): 

J^{x) _ 1 1 

J^_i(x) ~ 2u/x- 2{u + l)/x- 2(z/ + 2)/x- " " 



(4.52) 



However, fl4.52p is not immediately useful for evaluating the Bessel functions 
Jo(x) or Ji{x), as it only gives their ratio. 

The recurrence (14.511) may be evaluated backwards by Miller's algorithm. 
The idea is to start at some sufficiently large index u', take fu'+i = 0, fi,' = 1, 
and evaluate the recurrence 

fu-i + fu+i = —fu (4.53) 

X 
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backwards to obtain fu'-i, ■ ■ ■ , fo- However, (I4.53P is the same recurrence as 
fl4.5ip . so we expect to obtain fo ~ cJq{x) where c is some scale factor. We 
can use the identity 

oo 

Jo(x) + 2 J]j2.(x) = 1 (4.54) 

to determine c. 

To understand why Miller's algorithm works, and why evaluation of the 
recurrence 04.511) in the forward direction is numerically unstable for u > x, 
we observe that the recurrence (14.531) has two independent solutions: the 
desired solution Ju{x), and an undesired solution Y^{x), where Y^{x) is a 
Bessel function of the second kind, see Eqn. (14.491) . The general solution of 
the recurrence (14.531) is a linear combination of the special solutions Ju{x) 
and Y^{x). Due to rounding errors, the computed solution will also be a linear 
combination, say aJy{x) + hYi,{x). Since |yjy(a;)| increases exponentially with 
V when v > ex/2, but |J!y(x)| is bounded, the unwanted component will 
increase exponentially if we use the recurrence in the forward direction, but 
decrease if we use it in the backward direction. 

More precisely, we have 

as z/ — 7- +00 with x fixed. Thus, when u is large and greater than ex/2, Ju{x) 
is small and |^jy(x)| is large. 

Miller's algorithm seems to be the most effective method in the region 
where the power series (I4.48P suffers from catastrophic cancellation but asymp- 
totic expansions are not sufficiently accurate. For more on Miller's algorithm, 
see UJ2\ 

4.7.2 Evaluation of Bernoulli and Tangent numbers 

In §4.5^ Equations (I4.35P and (14.380 . the Bernoulli numbers i?2fc or scaled 
Bernoulli numbers Ck = B2k/{2k)\ were required. These constants can be 
defined by the generating functions 

oo u 
fc=0 
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CkX 



2k X , X x/2 



1 ^ 2 tanh(x/2) ' ^"^'^^^ 



k=0 

Multiplying both sides of fl4.56p or (14.571) by — 1, and equating coefficients, 
gives the recurrence relations 



and 



^.^^ (2A; + l-2j)! 2(2fc)! 



(4.59) 



These recurrences, or slight variants with similar numerical properties, have 
often been used to evaluate Bernoulli numbers. 

In this chapter our philosophy is that the required precision is not known 
in advance, so it is not possible to precompute the Bernoulli numbers and 
store them in a table once and for all. Thus, we need a good algorithm for 
computing them at runtime. 

Unfortunately, forward evaluation of the recurrence (14.581) . or the corre- 
sponding recurrence (14.591) for the scaled Bernoulli numbers, is numerically 
unstable: using precision n the relative error in the computed B2k or Ck is 
of order 4*^2"": see Exercise I4.35[ 

Despite its numerical instability, use of (I4.59P may give the Ck to accept- 
able accuracy if they are only needed to generate coefficients in an Euler- 
Maclaurin expansion whose successive terms diminish by at least a factor of 
four (or if the Ck are computed using exact rational arithmetic). If the Ck are 
required to precision n, then (I4.59P should be used with sufficient guard dig- 
its, or (better) a more stable recurrence should be used. If we multiply both 
sides of (I4.57P by sinh(x/2)/x and equate coefficients, we get the recurrence 



- ^ (4.60) 



^.^^ (2A; + 1 -2j)!4'=-J (2A;)! 4^ 

If (I4.60p is used to evaluate Ck, using precision n arithmetic, the relative 
error is only 0(A;^2~"). Thus, use of (I4.60p gives a stable algorithm for eval- 
uating the scaled Bernoulli numbers Ck (and hence, if desired, the Bernoulli 
numbers). 
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An even better, and perfectly stable, way to compute Bernoulli numbers 
is to exploit their relationship with the tangent numbers Tj, defined by 

tanx = X:T,(|^- (4.61) 

The tangent numbers are positive integers and can be expressed in terms of 
Bernoulli numbers: 

T,- = [2"^ - 1) ^ • (4.62) 

Conversely, the Bernoulli numbers can be expressed in terms of tangent num- 
bers: 

fl ifj = 0, 

-1/2 ifj = l, 

(-l)^/2-ijT,/2/(4-'' - T) if J > is even, 
otherwise. 



5, 



Eqn. fl4.62p shows that the odd primes in the denominator of the Bernoulli 
number i?2j must be divisors of 2^-' — 1. In fact, this is a consequence of 
Fermat's little theorem and the Von Staudt-Clausen theorem, which says 
that the primes p dividing the denominator of i?2j are precisely those for 
which (p- l)|2j (see Sl2]). 

We now derive a recurrence that can be used to compute tangent numbers, 
using only integer arithmetic. For brevity write t = tanx and D = d/dx. 
Then Dt = sec^x = 1 + t^. It follows that D(r) = nf'-^l + t^) for all 
n e W. 

It is clear that D^'t is a polynomial in t, say Pn{t). For example, -Po(^) = 't, 
Piit) = 1 + 1^, etc. Write Pnit) = Ylij>o'Pn,jV ■ From the recurrence P„(t) = 
DPn-i(t), and the formula for D(t"-) just noted, we see that deg(P„) = n + 1 
and 

j>0 j>0 

so 

Pn,j = U - l)Pn-l,j-l + (j + l)Pn-l,j+l (4.63) 

for all n E N*. Using f l4.63p it is straightforward to compute the coefficients 
of the polynomials Pi(t), P2it), etc. 
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Algorithm 4.3 TangentNumbers 



Input: positive integer m 




Output: Tangent numbers Ti, . . 


T 


Ti ^ 1 




for k from 2 to m do 








for k from 2 to m do 




for j from k to m do 




^ (j - k)T,^, + (j - 


k + 2)Tj 


return Ti,T2, . . . ,Tm. 





Observe tliat, since tanx is an odd function of x, the polynomials P2k{t) 
are odd, and the polynomials P2k+i{'t) are even. Equivalently, Pnj = if 
n + j is even. 

We are interested in the tangent numbers Tk = P2fe-i(0) = P2k-i,o- Using 
the recurrence fl4.63p but avoiding computation of the coefficients that are 
known to vanish, we obtain Algorithm TangentNumbers for the in-place 
computation of tangent numbers. Note that this algorithm uses only arith- 
metic on non-negative integers. If implemented with single-precision integers, 
there may be problems with overflow as the tangent numbers grow rapidly. If 
implemented using floating-point arithmetic, it is numerically stable because 
there is no cancellation. An analogous algorithm SecantNumbers is the 
topic of Exercise 14.401 

The tangent numbers grow rapidly because the generating function tan x 
has poles at x = ±7i/2. Thus, we expect Tk to grow roughly like 
{2k - 1)! (2/7r)2^\ More precisely, 

n _2^^+^(l-2'^^)C(2fc) ^^g^^ 



(2fc-l)! 7r^>' 

where ({s) is the usual Riemann zet a- function, and 

(1 -2-^)C(s) = 1 + 3-^ + 5-^ + ••• 

is sometimes called the odd zet a-f unction. 

The Bernoulli numbers also grow rapidly, but not quite as fast as the 
tangent numbers, because the singularities of the generating function f l4.56p 
are further from the origin (at ±2z7r instead of ±7r/2). It is well-known that 
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the Riemann zeta-function for even non-negative integer arguments can be 
expressed in terms of Bernoulli numbers - the relation is 

,,„ii?2fc 2C(2fc) 
Since C(2A;) = 1 + 0(4"'') as /c -> +oo, we see that 

It is easy to see that f l4.64p and (14.651) are equivalent, in view of the rela- 
tion I KU2^ . 

An asymptotically fast way of computing Bernoulli numbers is the topic 
of Exercise 14.411 For yet another way of computing Bernoulli numbers, using 
very little space, see §4.1U[ 



4.8 Arithmetic-Geometric Mean 

The (theoretically) fastest known methods for very large precision n use the 
arithmetic-geometric mean (AGM) iteration of Gauss and Legendre. The 
AGM is another nonlinear recurrence, important enough to treat separately. 
Its complexity is 0{M{n) Inn); the implicit constant here can be quite large, 
so other methods are better for small n. 

Given (ao, bo), the AGM iteration is defined by 




For simplicity we only consider real, positive starting values (oq, bo) here (for 
complex starting values, see § §4.8.51 14.121) . The AGM iteration converges 
quadratically to a limit which we denote by AGM(ao, bo)- 
The AGM is useful because: 

1. it converges quadratically. Eventually the number of correct digits 
doubles at each iteration, so only O(logn) iterations are required; 

2. each iteration takes time 0{M{n)) because the square root can be 
computed in time 0{M{n)) by Newton's method (see §3.51 and §4.2.31) : 
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3. if we take suitable starting values (oq, bo), the result AGM(ao, 6o) can be 
used to compute logarithms (directly) and other elementary functions 
(less directly), as well as constants such as vr and In 2. 

4.8.1 Elliptic Integrals 

The theory of the AGM iteration is intimately linked to the theory of elliptic 
integrals. The complete elliptic integral of the first kind is defined by 

. r . /' -^^^^ . ,4.67) 







'l-k^sin'e Jo ^(l-t2)(i_pt2) 
and the complete elliptic integral of the second kind is 



E{k)= ^/l-k^smHde= / J % dt, 



where k E [0, 1] is called the modulus and k' = y/1 — k"^ is the complementary 
modulus. It is traditional (though confusing as the prime does not denote 
differentiation) to write K'lk) for K{k') and E'{k) for E{k'). 

The Connection With Elliptic Integrals. Gauss discovered that 

^ -K\k). (4.68) 



AGM(l,fc) vr 

This identity can be used to compute the elliptic integral K rapidly via 
the AGM iteration. We can also use it to compute logarithms. From the 
definition fl4.67p . we see that K{k) has a series expansion that converges for 
|A;| < 1 (in fact K{k) = (7r/2)F(l/2, 1/2; 1; k'^) is a hypergeometric function). 
For small k we have 

K{k) = ^(^l + '^ + 0{k')y (4.69) 

It can also be shown that 

K'{k) = ^\n(^^^ K{k)-^ + 0{k''). (4.70) 
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4.8.2 First AGM Algorithm for the Logarithm 



Prom the formulae (I4.68p . fl4.69p and (14. 70 p . we easily get 




= In (1) (1 + 0(P)) . 



(4.71) 



Thus, if x = 4//c is large, we have 



ln(x) 



7r/2 



( 




) 



) 



AGM(l,4/x) 



li X > 2"/^, we can compute ln(x) to precision n using the AGM iteration. 
It takes about 21g(?7,) iterations to converge if x G [2*^/^,2"]. 

Note that we need the constant vr, which could be computed by using 
our formula twice with slightly different arguments Xi and X2, then taking 
differences to approximate (dln(x)/dx)/7r at xi (see Exercise I4.44p . More 
efficient is to use the Brent- Salamin (or Gauss-Legendre) algorithm, which 
is based on the AGM and the Legendre relation 



Argument Expansion. If x is not large enough, we can compute 



by the AGM method (assuming the constant In 2 is known). Alternatively, 
if X > 1 , we can square x enough times and compute 



This method with x = 2 gives a way of computing In 2, assuming we already 
know vr. 

The Error Term. The 0(/c^) error term in the formula fl4.7ip is a nuisance. 
A rigorous bound is 



EK' + E'K - KK' = - ■ 

2 



(4.72) 



ln(2^x) = £ In 2 + In X 





(4.73) 
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for all k G (0, 1], and the bound can be sharpened to 0.37fc^(2.4 — \n{k)) if 
k G (0,0.5]. 

The error 0(/c^| lnk\) makes it difficult to accelerate convergence by using 
a larger value of k (i.e., a value of x = 4//c smaller than 2"/^). There is an 
exact formula which is much more elegant and avoids this problem. Before 
giving this formula we need to define some theta functions and show how 
they can be used to parameterise the AGM iteration. 

4.8.3 Theta Functions 

We need the theta functions 92{q), Osiq) and 6'4(g), defined for |g| < 1 by: 

+ 00 +00 

9,{q) = 5^ g("+V2)^=2gi/^^g"("+i), (4.74) 

n=— oo n=0 

+ 00 +00 

Osiq) = 9"' = 1 + 25^?"', (4.75) 

n=— 00 n=l 

+00 

OA{q) = ^3(-g) = l + 25^(-l)V'. (4.76) 

n=l 

Note that the defining power series are sparse so it is easy to compute 92{q) 
and ^3(g) for small q. Unfortunately, the rectangular splitting method of 
§4.4.31 does not help to speed up the computation. 

The asymptotically fastest methods to compute theta functions use the 
AGM. However, we do not follow this trail because it would lead us in circles! 
We want to use theta functions to give starting values for the AGM iteration. 



Theta Function Identities. There are many classical identities involving 
theta functions. Two that are of interest to us are: 



2 

The latter may be written as 



eiiq') and e,{q)0,iq) = 0i{q' 



Oliq)0liq) = eliq'' 



to show the connection with the AGM: 
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AGM{el{q),9l{q)) 



AGMi9l{q'),eliq')) = 
AGM{el{q'''),el{q''')) 



1 



for any |g| < 1. (The limit is 1 because converges to 0, thus both 6^ and 
64 converge to 1.) Apart from scahng, the AGM iteration is parameterised 



hj{el{q''),9l{q'')) fork = 0,1,2,... 

The Scaling Factor. Since AGM{el{q),el{q)) = 1, and AGM(Aa, A6) = 
A ■ AGM(a, b), scaling gives AGM(1, k') = l/^i(g) if k' = el{q)/el{q). Equiv- 
alently, since O^ + Of = 9^ (Jacobi), k = Ol{q) / 9^{q) . However, we know (from 
flTOj) with k k') that 1/AGM(1,A;') = 2K{k)/'K, so 



Thus, the theta functions are closely related to elliptic integrals. In the 
literature q is usually called the nome associated with the modulus k. 

From q to k and k to q. We saw that k = 9l{q)/el{q), which gives k in 
terms of q. There is also a nice inverse formula which gives q in terms of k: 
q = exp{—7iK'{k)/K{k)), or equivalently 



Sasaki and Kanada's Formula. Substituting f l4.68p and f l4.77p with 
k = ^2(1) / ^Hq) iiito (14.781) gives Sasaki and Kanada's elegant formula: 



This leads to the following algorithm to compute In x. 

4.8.4 Second AGM Algorithm for the Logarithm 

Suppose X is large. Let q = 1/x, compute 6'2(g^) and O^i^q'^) from their 
defining series fHTill and fHTHll . then compute AGM(6'|(g^), 6'|(g^)). Sasaki 



K{k) = ^Oliq). 



(4.77) 




(4.78) 




(4.79) 
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and Kanada's formula (with q replaced by to avoid the g^/^ term in the 
definition of 62{q)) gives 

ln(x) - 



AGM(^i(g4),^32(g4)) 

There is a trade-off between increasing x (by squaring or multiplication by 
a power of 2, see the paragraph on "Argument Expansion" in §4.8.2^ . and 
taking longer to compute 6*2 (?^) and 9^{q'^) from their series. In practice it 
seems good to increase x until q = 1/x is small enough that 0{q^^) terms are 
negligible. Then we can use 

^2(g') = 2(g + g^ + g^^ + 0(g^^)), 

^3(g') = l + 2(g^ + gi6^0(g=^6)). 

We need x > 2^1"^^ which is much better than the requirement x > 2"/^ for 
the first AGM algorithm. We save about four AGM iterations at the cost of 
a few multiplications. 



Implementation Notes. Since 

we can avoid the first square root in the AGM iteration. Also, it only takes 
two nonscalar multiplications to compute 2^2^3 and 6'| + 6"^ from 6*2 and 6*3: 
see Exercise 14.451 Another speedup is possible by trading the multiplications 
for squares, see §4.121 



Drawbacks of the AGM. The AGM has three drawbacks: 

1. the AGM iteration is not self-correcting, so we have to work with full 
precision (plus any necessary guard digits) throughout. In contrast, 
when using Newton's method or evaluating power series, many of the 
computations can be performed with reduced precision, which saves a 
log 77, factor (this amounts to using a negative number of guard digits); 

2. the AGM with real arguments gives ln(x) directly. To obtain exp(a;) we 
need to apply Newton's method ( §4.2.51 and Exercise 14. 6p . To evaluate 
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trigonometric functions such as sin(x), cos(x), arctan(x) we need to 
work with complex arguments, which increases the constant hidden in 
the "O" time bound. Ahernatively, we can use Landen transformations 
for incomplete elliptic integrals, but this gives even larger constants; 

3. because it converges so fast, it is difficult to speed up the AGM. At 
best we can save 0(1) iterations (see however §4.121) . 

4.8.5 The Complex AGM 

In some cases the asymptotically fastest algorithms require the use of complex 
arithmetic to produce a real result. It would be nice to avoid this because 
complex arithmetic is significantly slower than real arithmetic. Examples 
where we seem to need complex arithmetic to get the asymptotically fastest 
algorithms are: 

1. arctan(x), arcsin(x), arccos(a;) via the AGM, using, for example, 

arctan(x) = 53(ln(l + ix)); 

2. tan(x), sin(a;), cos(x) using Newton's method and the above, or 

cos(x) + i sin(x) = exp(ix), 

where the complex exponential is computed by Newton's method from 
the complex logarithm (see Eqn. (14.111) ). 

The theory that we outlined for the AGM iteration and AGM algorithms 
for In(^) can be extended without problems to complex z ^ {—oo, 0], provided 
we always choose the square root with positive real part. 

A complex multiplication takes three real multiplications (using Karat- 
suba's trick), and a complex squaring takes two real multiplications. We can 
do even better in the FFT domain, assuming that one multiplication of cost 
M{n) is equivalent to three Fourier transforms. In this model a squaring costs 
2M(n)/3. A complex multiplication (a + ib){c + id) = {ac — hd) + i{ad + he) 
requires four forward and two backward transforms, thus costs 2M{n). A 
complex squaring (a + z6)^ = (a + 6) (a — 6) + i{2ab) requires two forward and 
two backward transforms, thus costs 4M(n)/3. Taking this into account, we 
get the asymptotic upper bounds relative to the cost of one multiplication 
given in Table 4.1 (0.666 should be interpreted as ~ 2M (?t,)/3, and so on). 
See §4.121 for details of the algorithms giving these constants. 
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Operation 


real 


complex 


squaring 


0.666 


1.333 


multiplication 


1.000 


2.000 


reciprocal 


1.444 


3.444 


division 


1.666 


4.777 


square root 


1.333 


5.333 


AGM iteration 


2.000 


6.666 


log via AGM 


4.000 Ign 


13.333 Ign 



Table 4.1: Costs in the FFT domain 



4.9 Binary Splitting 

Since the asymptotically fastest algorithms for arctan, sin, cos, etc. have a 
large constant hidden in their time bound 0{M{n) logn) (see "Drawbacks of 
the AGM", §4.8.4p . page 176), it is interesting to look for other algorithms 
that may be competitive for a large range of precisions, even if not asymptot- 
ically optimal. One such algorithm (or class of algorithms) is based on binary 
splitting or the closely related FEE method (see §4.12p . The time complexity 
of these algorithms is usually 

0((logn)"M(n)) 

for some constant a > 1 depending on how fast the relevant power series 
converges, and also on the multiplication algorithm (classical, Karatsuba or 
quasi- linear) . 



The Idea. Suppose we want to compute arctan(x) for rational x = p/q, 
where p and q are small integers and \x\ < 1/2. The Taylor series gives 



arctan - 



(2j + l)g2.+i 



0<j<n/2 



The finite sum, if computed exactly, gives a rational approximation P/ Q to 
arctan(p/g), and 

log \Q\ = 0{n logn). 
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(Note: the series for exp converges faster, so in this case we sum r^n/\nn 
terms and get log \ Q\ = 0{n).) 

The finite sum can be computed by the "divide and conquer" strategy: 
sum the first half to get Pi/Qi say, and the second half to get P2/Q2, then 

P ^Pi P2 ^ P1Q2 + P2Q1 

Q Qi Q2 Q1Q2 

The rationals Pi/Qi and P2/Q2 are computed by a recursive application of 
the same method, hence the term "binary splitting" . If used with quadratic 
multiplication, this way of computing P/Q does not help; however, fast mul- 
tiplication speeds up the balanced products P1Q2, P2Q1, and QiQ2- 



Complexity. The overall time complexity is 

/ rig(")i \ 

O ^ 2^M(2^^nlogn) =0((logn)"M(n)), (4.80) 

where a = 2 in the FFT range; in general a <2 (see Exercise I4.47p . 

We can save a little by working to precision n rather than nlogn at the 
top levels; but we still have a = 2 for quasi-linear multiplication. 

In practice the multiplication algorithm would not be fixed but would 
depend on the size of the integers being multiplied. The complexity would 
depend on the algorithm(s) used at the top levels. 



Repeated Application of the Idea. If x G (0, 0.25) and we want to 
compute arctan(a;), we can approximate x by a rational p/q and compute 
arctan(p/g) as a first approximation to arctan(x), say p/q < x < {p + ^)/q- 
Now, from fHTTD . 

tan(arctan(x) — arctan(p/g)) = - 



so 



where 



1 + px/q 

arctan(a;) = arctan(p/g) + arctan((5), 
X — p/q qx — p 



6 



1 + px/q q + px 
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We can apply the same idea to approximate arctan(5), until eventually we get 
a sufficiently accurate approximation to arctan(x). Note that \6\ < \x — p/q\ 
< so it is easy to ensure that the process converges. 

Complexity of Repeated Application. If we use a sequence of about 
Ign rationals Pi/qi,P2/Q2, ■ ■ ■, where 

q^ = 

then the computation of each arctan(pj/gj) takes time O {{log n)°'M[n)), and 
the overall time to compute arctan(x) is 

0((logn)"+iM(n)). 

Indeed, we have < < 2^' \ thus Pi has at most 2^~^ bits, and Pi/qi as a 
rational has value 0(2~^' ) and size 0(2*). The exponent a + 1 is 2 or 3. 
Although this is not asymptotically as fast as AGM-based algorithms, the 
implicit constants for binary splitting are small and the idea is useful for 
quite large n (at least 10^ decimal places). 

Generalisations. The idea of binary splitting can be generalised. For ex- 
ample, the Chudnovsky brothers gave a "bit-burst" algorithm which applies 
to fast evaluation of solutions of linear differential equations. This is de- 
scribed in §4.9.21 

4.9.1 A Binary Splitting Algorithm for sin, cos 

In [ini Theorem 6.2], Brent claims an 0{M{n) log^ n) algorithm for expx and 
sinx, however the proof only covers the case of the exponential, and ends with 
"the proof of (6.28) is similar" . He had in mind deducing sin x from a complex 
computation of exp(zx) = cosx + i sinx. Algorithm SinCos is a variation of 
Brent's algorithm for expx that computes sinx and cosx simultaneously, in 
a way that avoids computations with complex numbers. The simultaneous 
computation of sinx and cosx might be useful, for example, to compute tana; 
or a plane rotation through the angle x. 

At step [2] of Algorithm SinCos, we have Xj = yj + xj+i, thus sinxj = 
sin yj cosXj-|_i+cos sinxj+i, and similarly for cosxj, explaining the formula 
used at step El Step [5] uses a binary splitting algorithm similar to the one 
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Algorithm 4.4 SinCos 

Input: floating-point < a; < 1/2, integer n 

Output: an approximation of sinx and cosx witli error 0(2~") 
1: write X ~ Yli=oPi ' 2~^'^^ wliere < pj < 2^' and k = \\gn \ — 1 
2: let Xj = Yli=jPi ■ 2"^'^', witli Xk+i = 0, and yj = pj ■ 2~2'^' 
3: (S'/t+i, Cfc+i) ^ (0, 1) > Sj is sinxj and Cj is cosxj 

4: for j from k downto do 

5: compute sinyj and cosyj using binary splitting 

6: Sj -ir- sin yj ■ C^+i + cos yj ■ Sj^i , Cj cos yj ■ Cj+i — sin yj ■ Sj+i 

7: return (5*0, Cq). 



described above for arctan(p/g): yj is a small rational, or is small itself, so 
that all needed powers do not exceed n bits in size. This algorithm has the 
same complexity 0(M(n) log^n) as Brent's algorithm for expx. 

4.9.2 The Bit-Burst Algorithm 

The binary-splitting algorithms described above for arctanx, expx, sinx 
rely on a functional equation: tan{x + y) = (tanx -|- tan|/)/(l — tanxtany), 
exp{x + y) = exp(x) exp(|/), sm{x + y) = sin x cos ?/ + sin ?/ cos x. We describe 
here a more general algorithm, known as the "bit-burst" algorithm, which 
does not require such a functional equation. This algorithm applies to a class 
of functions known as holonomic functions. Other names are differentiably 
finite and D-finite. 

A function /(x) is said to be holonomic iff it satisfies a linear homoge- 
neous differential equation with polynomial coefficients in x. Equivalently, 
the Taylor coefficients Uk of / satisfy a linear homogeneous recurrence with 
coefficients polynomial in k. The set of holonomic functions is closed under 
the operations of addition and multiplication, but not necessarily under di- 
vision. For example, the exp. In, sin, cos functions are holonomic, but tan is 
not. 

An important subclass of holonomic functions is the hypergeometric func- 
tions, whose Taylor coefficients satisfy a recurrence Uk+i/uk = R{k), where 
R{k) is a rational function of k (see §4.40 . This matches the second defini- 
tion above, because we can write it as Uk+iQ{k) — UkP{k) = if R{k) = 
P{k)/Q{k). Holonomic functions are much more general than hypergeomet- 
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ric functions (see Exercise 14.480 : in particular the ratio of two consecutive 
terms in a hypergeometric series has size 0{logk) (as a rational number), 
but can be much larger for holonomic functions. 

Theorem 4.9.1 /// is holonomic and has no singularities on a finite, closed 
interval [A,B], where A < < B and /(O) = 0, then f{x) can he com- 
puted to an (absolute) accuracy of n hits, for any n-hit floating-point number 
X e {A,B), in time 0{M{n)\og^ n). 

Notes: For a sharper result, see Exercise 14.491 The condition /(O) = is 
just a technical condition to simplify the proof of the theorem; /(O) can be 
any value that can be computed to n bits in time 0{M{n) log'^n). 

Proof. Without loss of generality, we assume < x < 1 < -B; the binary 
expansion of x can then be written x = 0.bib2 ■ ■ - bn- Define ri = O.^i, 
r2 = 0.062^3, ''"3 = 0.000646566^7 (the same decomposition was already used 
in Algorithm SinCos): ri consists of the first bit of the binary expansion of 
X, r2 consists of the next two bits, r^ the next four bits, and so on. We thus 
have X = ri + r2 + . . . + Tfc where 2''~^ <n<2^. 

Define Xi = ri + ■ ■ ■ + rj with xq = 0. The idea of the algorithm is 
to translate the Taylor series of / from Xj to Xj+i; since / is holonomic, 
this reduces to translating the recurrence on the corresponding coefficients. 
The condition that / has no singularity in [0,x] C -B] ensures that the 
translated recurrence is well-defined. We define fo(t) = f(t), fi(t) = /o(ri + 
t), f2{t) = fM + t), f,{t) = fi.,{ri + t) for t<k. We have /,(t) = 
f{xi + t), and /fc(t) = f{x + t) since Xk = x. Thus we are looking for 

fm = fix). 

Let f*{t) = fi{t) — fi{0) be the non-constant part of the Taylor expan- 
sion of fi. We have /*(ri+i) = /i(ri+i) - /i(0) = /i+i(0) - /i(0) because 
fi+i{t) = fi{ri+i + t). Thus: 



/o(n) + --- + A*_i(rfc) = (/i(0)-/o(0)) + --- + (/,(0)-/,_i(0)) 

= /fc(0)-/o(0) = /(x)-/(0). 

Since /(O) = 0, this gives: 

fc-i 

/(^) = E/*(rm). 

i=0 
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To conclude the proof, we will show that each term /*(rj+i) can be eval- 
uated to n bits in time 0{M{n)log^ n). The rational rj+i has a numerator 
of at most 2* bits, and 

Thus, to evaluate /*(rj+i) to n bits, n/2* + O(logn) terms of the Taylor 
expansion of f*{t) are enough. We now use the fact that / is holonomic. 
Assume / satisfies the following homogeneous lineai0 differential equation 
with polynomial coefficients: 

cMf^'^Ht) + ■■■ + ci(t)/'(t) + co(t)/(t) = 0. 

Substituting Xi + t for t, we obtain a differential equation for ff. 

Cm{x, + t)ft^\t) + ■ ■ ■ + ci(x, + t)j[{t) + co(x, + t)fi{t) = 0. 

From this equation we deduce (see §4.12p a linear recurrence for the Taylor 
coefficients of fi{t), of the same order as that for f{t). The coefficients in the 
recurrence for fi(t) have 0(2*) bits, since = ri + ■ ■ ■ + rj has 0(2*) bits. 
It follows that the i-th Taylor coefficient of fi{t) has size 0(£(2* + log£)). 
The ilogi term comes from the polynomials in i in the recurrence. Since 
i < + 0{logn), this is 0{n logn). 

However, we do not want to evaluate the i-th Taylor coefficient ui of fi{t), 
but the series 

e 

= ^^j^i+i ~ fiiri+i). 

Noting that ui = (s^— s^_i)/ rf_^_^, and substituting this value in the recurrence 
for (ui), say of order d, we obtain a recurrence of order ci-l- 1 for (s^). Putting 
this latter recurrence in matrix form = M^Si^i, where is the vector 
{si, Si_i, Si_d), we obtain 

Se = MeMe.i---Md+^Sd, (4.81) 

where the matrix product M^M^.i ■ ■ ■ M^-^-i can be evaluated in time 
0{M{n) log^ n) using binary splitting. □ 

"'^"'^If / satisfies a non-homogeneous differential equation, say 
E{t, f{t), f'{t), . . . , f^'''>{t)) — b{t), where b{t) is polynomial in t, differentiating it yields 
Fltjlt),f'{t),...j'^''+^'>it)) = b'{t), and b'{t)E{-) - b{t)F{-) is homogeneous. 
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We illustrate Theorem 14 . 9 . 1 1 with the arc-tangent function, which satisfies 
the differential equation + 1^) = 1. This equation evaluates at Xi + t 

to + (xj + ty) = 1, where fi(t) = f{xi + 1). This gives the recurrence 

(1 + x''i)iue + 2xi{i - + {£ - 2)ue^2 = 

for the Taylor coefficients U£ of /«. This recurrence translates to 

(1 + x^i)ivi + 2xiri+,{£ - l)ve-i + r^^,{i - 2)ve-2 = 

for Vi = uitIj^^, and to 

(1 + xlYis^ - s^-i) + 2xiri+i [l - 1) - s^_2) + rf+i (£ - 2) (s^_2 - s^.g) = 

for Si = X^^^i^j . This recurrence of order 3 can be written in matrix form, 
and Eqn. (14.810 enables one to efficiently compute S£ ~ /i(rj + 1) — /j(0) 
using multiplication of 3 x 3 matrices and fast integer multiplication. 



4.10 Contour Integration 

In this section we assume that facilities for arbitrary-precision complex arith- 
metic are available. These can be built on top of an arbitrary-precision real 
arithmetic package (see Chapters [3] and [5]) . 

Let f{z) be holomorphic in the disc \z\ < R, R > 1, and let the power 
series for / be 

oo 

f{z) = J2 a. (4.82) 

j=0 

From Cauchy's theorem |122[ Ch. 7] we have 

a,. = — / dz, (4.83) 

where C is the unit circle. The contour integral in (I4.83P may be approxi- 
mated numerically by sums 

^ fc-i 

^^'^ j(g2^Wfc)e-2-iWfc. (4.84) 

m=0 
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Let C be a circle with centre at the origin and radius p G (1,-R). From 
Cauchy's theorem, assuming that j < k, we have (see Exercise I4.50p : 

Sj,k ~ = ^ J^^ (^^k _ ^ + ' ^'^■^^'> 

so |S'j,fc %| = 0{{R — 5)^*^-'+^^) as — oo, for any 6 > 0. For example, let 

/(-) = ^ + f (4.86) 

be the generating function for the scaled Bernoulli numbers as in f l4.57p . so 
a2j = Cj = B2j/{2j)\ and R = 2tt (because of the poles at ±27ri). Then 

^^-'=-M"WTI)!^WTW + '''' ^ ^ ^ 

so we can evaluate B2j with relative error 0((27r)~^) by evaluating f{z) at k 
points on the unit circle. 

There is some cancellation when using (14.841) to evaluate S2j,k because 
the terms in the sum are of order unity but the result is of order {2'k)~'^K 
Thus 0{i) guard digits are needed. In the following we assume j = 0{n). 

If exp{—2TTijm/k) is computed efficiently from exp(— 27ri/A;) in the obvi- 
ous way, the time required to evaluate B2, . . . , B2j to precision n is 0{jnM{n)), 
and the space required is 0(n). We assume here that we need all Bernoulli 
numbers up to index 2j, but we do not need to store all of them simultane- 
ously. This is the case if we are using the Bernoulli numbers as coefficients 
in a sum such as (14.381) . 

The recurrence relation method of §4.7.21 is faster but requires space 
Q{jn). Thus, the method of contour integration has advantages if space 
is critical. 

For comments on other forms of numerical quadrature, see §4.12[ 



4.11 Exercises 

Exercise 4.1 If A{x) = '}2ij>o^j^'' ^ formal power series over M with oq = 1, 
show that 111(74(3;)) can be computed with error 0{x"') in time 0{M[n)), where 
M[n) is the time required to multiply two polynomials of degree n — 1. Assume a 
reasonable smoothness condition on the growth of M(n) as a function of n. [Hint: 
(d/dx) ln(^(x)) = A'{x)/A{x).] Does a similar result hold for n-bit numbers if x 
is replaced by 1/2? 
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Exercise 4.2 (Schonhage |198j and Schost) Assume one wants to compute 
l/s{x) mod x", for s{x) a power series. Design an algorithm using an odd-even 
scheme ( §1.3.5p . and estimate its complexity in the FFT range. 

Exercise 4.3 Suppose that g and h are sufficiently smooth functions satisfying 
g{h{x)) = X on some interval. Let yj = h{xj). Show that the iteration 

lit. 

m=l 

is a fe-th order iteration that (under suitable conditions) will converge to x = g{y). 
[Hint: generalise the argument leading to (I4.16p .] 

Exercise 4.4 Design a Horner-like algorithm for evaluating a series X]j=o '^i^"' ™ 
the forward direction, while deciding dynamically where to stop. For the stopping 
criterion, assume that the \aj\ are monotonic decreasing and that \x\ < 1/2. [Hint: 
use y = l/x.] 

Exercise 4.5 Assume one wants n bits of exp x for x of order 2^ , with the repeated 
use of the doubling formula ( §4.3.ip . and the naive method to evaluate power series. 
What is the best reduced argument x/2^ in terms of n and j7 [Consider both cases 
j > and j < 0.] 

Exercise 4.6 Assuming one can compute an n-bit approximation to Inx in time 
T{n), where n <^ M[n) = o{T{n)), show how to compute an n-bit approxima- 
tion to expx in time ~ T{n). Assume that T{n) and M{n) satisfy reasonable 
smoothness conditions. 

Exercise 4.7 Care has to be taken to use enough guard digits when computing 
exp(x) by argument reduction followed by the power series ()4.2ip . If x is of order 
unity and k steps of argument reduction are used to compute exp(a;) via 

exp(x) = ^exp(x/2' 

show that about k bits of precision will be lost (so it is necessary to use about k 
guard bits). 

Exercise 4.8 Show that the problem analysed in Exercise 14.71 can be avoided if 
we work with the function 

OO j 

expml(x) = exp(x) — 1 = — 

j=i 

which satisfies the doubling formula expml(2x) = expml(x)(2 -|- expml(x)). 
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Exercise 4.9 For x > —1, prove the reduction formula 



loglp(x) = 2 loglp 



1 + VI + 2; 



X 



) 



where the function loglp(x) is defined by loglp(3;) = ln(l+2;), as in ^4.4.2[ Explain 
why it might be desirable to work with loglp instead of In in order to avoid loss 
of precision (in the argument reduction, rather than in the reconstruction as in 
Exercise 14 .jp . Note however that argument reduction for loglp is more expensive 
than that for expml, because of the square root. 

Exercise 4.10 Give a numerically stable way of computing sinh(x) using one 
evaluation of expml (|x|) and a small number of additional operations (compare 
Eqn. (ICTD V 

Exercise 4.11 (White) Show that exp(2;) can be computed via sinh(x) using 
the formula 



this saves computing about half the terms in the power series for exp(a;) at the 
expense of one square root. How would you modify this method to preserve numer- 
ical stability for negative arguments x? Can this idea be used for other functions 
than exp(x)? 

Exercise 4.12 Count precisely the number of nonscalar products necessary for 
the two variants of rectangular series splitting ( ^4.4.3p . 

Exercise 4.13 A drawback of rectangular series splitting as presented in ^4.4.31 
is that the coefficients {ake-^-m in the classical splitting, or ajm+e i^i the modular 
splitting) involved in the scalar multiplications might become large. Indeed, they 
are typically a product of factorials, and thus have size 0{dlogd). Assuming that 
the ratios aj+i/oj are small rationals, propose an alternate way of evaluating P{x). 

Exercise 4.14 Make explicit the cost of the slowly growing function c{d) ( ^4.4.3p . 




Since 




— X 



,2fc+l 



Exercise 4.15 Prove the remainder term (I4.28P in the expansion ()4.27p for Ei(x). 
[Hint: prove the result by induction on A;, using integration by parts in the for- 
mula (11211) •] 
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Exercise 4.16 Show that we can avoid using Cauchy principal value integrals by 
defining Ei(z) and 'Ei{z) in terms of the entire function 

Ein(.)= r^-"p'-'>dt=y:'-^>r"'. 



Exercise 4.17 Let Ei(a;) be defined by (|4.25p for real x > 0. Using ()4.27p . show 
that 

^ < e'' Ei{x) < - ■ 



X x^ 



Exercise 4.18 In this exercise the series are purely formal, so ignore any questions 
of convergence. Applications are given in Exercises I4.19f[4r20l 

Suppose that (aj)jgi^ is a sequence with exponential generating function .s{z) = 
^JLoajZ^ Suppose that An = Yl]=o ^nd let S{z) = I]^o^i^Vj! be 

the exponential generating function of the sequence (^„)neN- Show that 

S{z) = exp(2;)s(2;). 



Exercise 4.19 The power series for Ein(2:) given in Exercise 14.161 suffers from 
catastrophic cancellation when z is large and positive (like the series for exp(— z)). 
Use Exercise 14.181 to show that this problem can be avoided by using the power 
series (where Hn denotes the n-th harmonic number) 

e^Ein(z) = ^^- 
i=i 



Exercise 4.20 Show that Eqn. ([03|) for erf(2;) follows from Eqn. ([02]) . [Hint: 
this is similar to Exercise 14. 191 ] 



Exercise 4.21 Give an algorithm to evaluate r(x) for real x > 1/2, with guar- 
anteed relative error 0(2~"). Use the method sketched in ^4.51 for Inr(x). What 
can you say about the complexity of the algorithm? 
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Exercise 4.22 Extend your solution to Exercise 14.211 to give an algorithm to 
evaluate 1/T{z) for z £ C, with guaranteed relative error 0(2""'). Note: T{z) has 
poles at zero and the negative integers (that is, for —z G N), but we overcome this 
difficulty by computing the entire function 1/T{z). Warning: \T{z)\ can be very 
small if Q{z) is large. This follows from Stirling's asymptotic expansion. In the 
particular case oi z = iy on the imaginary axis we have 

21n|r(i2/)| =lnf— ^ -7r\y\. 
\ysinh{7ry)J 

More generally, 

|r(x + ^ 27r|?/|^^"^ exp(-7r|y|) 
for G M and \y\ large. 

Exercise 4.23 The usual form (I4.38P of Stirling's approximation for ln(r(2;)) in- 
volves a divergent series. It is possible to give a version of Stirling's approximation 
where the series is convergent: 

lnr{z) = ( z --)lnz - z + '^^^ + y —-^^ (4.88) 

where the constants Ck can be expressed in terms of Stirling numbers of the first 
kind, s{n,k), defined by the generating function 

n 

s(n, k)J^ = x{x — \) ■ ■ ■ [x — n + 1). 

k=0 

In fact 

^ J_ j\s{n,j)\ 
''~2A;^^(j + l)(j + 2)' 

The Stirling numbers s{n, k) can be computed easily from a three-term recurrence, 
so this gives a feasible alternative to the usual form of Stirling's approximation 
with coefficients related to Bernoulli numbers. 

Show, experimentally and/or theoretically, that the convergent form of Stir- 
ling's approximation is not an improvement over the usual form as used in Exer- 
cise [1211 

Exercise 4.24 Implement procedures to evaluate Ei(x) to high precision for real 
positive X, using (a) the power series (j4.26p . (b) the asymptotic expansion (I4.27P 
(if sufficiently accurate) , (c) the method of Exercise I4.19| and (d) the continued 
fraction (I4.39P using the backward and forward recurrences as suggested in §4.61 
Determine empirically the regions where each method is the fastest. 
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Exercise 4.25 Prove the backward recurrence ()4.43p . 

Exercise 4.26 Prove the forward recurrence (|4.44|) . 
[Hint: let 

yk{x) 



bi+ 6fc_i+ bi, + x 
Show, by induction on A; > 1, that 

, . Pk + Pk-ix , 

yk[x) = ■ I 

Qk + Qk-ix 

Exercise 4.27 For the forward recurrence (|4.44p . show that 

Qk Qk-i \ = (bil\(b2l\ ( bk 1 
Pk Pk^i ) V «i y V «2 y " " " 1^ aj, 

holds for A; > (and for A; = if we define P-i, Q-i appropriately). 

Remark. This gives a way to use parallelism when evaluating continued fractions. 

Exercise 4.28 For the forward recurrence (I4.44p . show that 

{-l)''aia2 • • • fflfe- 



Qk Qk-l 
Pk Pk^l 



Exercise 4.29 Prove the identity ()4.46p . 
Exercise 4.30 Prove Theorem 14.6.11 

Exercise 4.31 Investigate using the continued fraction (j4.40p for evaluating the 
complementary error function erfc(x) or the error function erf(x) = 1 — erfc(2;). 
Is there a region where the continued fraction is preferable to any of the methods 
used in Algorithm Erf of ^4.61 ? 



Exercise 4.32 Show that the continued fraction ()4.4ip can be evaluated in time 
0{M{k)\ogk) if the aj and bj are bounded integers (or rational numbers with 
bounded numerators and denominators). [Hint: use Exercise 14.271 ] 

Exercise 4.33 Instead of (|4.54p . a different normalisation condition 

oo 

Jo{xf + 2^Mxf = l (4.89) 

could be used in Miller's algorithm. Which of these normalisation conditions is 
preferable? 
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Exercise 4.34 Consider the recurrence fu-i + fu+i = '^Kfy, where > is a 
fixed real constant. We can expect the solution to this recurrence to give some 
insight into the behaviour of the recurrence (I4.53P in the region v ~ Kx. Assume 
for simplicity that K ^ \. Show that the general solution has the form 

U = AX'' + B^^\ 

where A and fi are the roots of the quadratic equation — 2Kx + 1 = 0, and 
A and B are constants determined by the initial conditions. Show that there are 
two cases: K < 1 then A and ^ are complex conjugates on the unit circle, so 
|A| = = 1; if -ftT > 1 then there are two real roots satisfying Xfi = 1. 

Exercise 4.35 Prove (or give a plausibility argument for) the statements made 
in ^4.71 that: (a) if a recurrence based on (j4.59p is used to evaluate the scaled 
Bernoulli number Ck, using precision n arithmetic, then the relative error is of 
order 4^=2""; and (b) if a recurrence based on (j4.60p is used, then the relative error 
is 0(A;22-"). 

Exercise 4.36 Starting from the definition (j4.56p . prove Eqn. ()4.57p . Deduce the 
relation (I4.62P connecting tangent numbers and Bernoulli numbers. 

Exercise 4.37 (a) Show that the number of bits required to represent the tangent 
number Tk exactly is ~ 2A: Ig /c as /c — )• oo. (b) Show that the same applies for the 
exact representation of the Bernoulli number i?2fe as a rational number. 

Exercise 4.38 Explain how the correctness of Algorithm TangentNumbers 
( §4.7.2p follows from the recurrence (|4.63p . 



Algorithm 4.5 SecantNumbers 

Input: positive integer m 

Output: Secant numbers 5*0, 5*1, ... , 5*^ 

for k from 1 to m do 

Sk ^ kSk-i 
for k from 1 to m do 

for j from /c + 1 to m do 

S,^ij-k)S,^, + ij-k + l)Sj 

return Sq, Si, . . . , Sm- 
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Exercise 4.39 Show that the complexity of computing the tangent numbers 
Ti, . . . , Trn by Algorithm TangentNumbers ( ^4.7.2p is 0{m^ log m). Assume that 
the multiplications of tangent numbers Tj by small integers take time O(logTj). 
[Hint: use the result of Exercise 14.371 ] 

Exercise 4.40 Verify that Algorithm SecantNumbers computes in-place the 
Secant numbers Sk, defined by the generating function 



in much the same way that Algorithm TangentNumbers ( §4.7.2p computes the 
Tangent numbers. 

Exercise 4.41 (Harvey) The generating function (j4.56p for Bernoulli numbers 
can be written as 



and we can use an asymptotically fast algorithm to compute the first n + 1 terms in 
the reciprocal of the power series. This should be asymptotically faster than using 
the recurrences given in §4.7.21 Give an algorithm using this idea to compute 
the Bernoulli numbers Bq, Bi, . . . , Bn in time 0(?i^(log n)^"^^). Implement your 
algorithm and see how large n needs to be for it to be faster than the algorithms 
discussed in §4.7.21 



Algorithm 4.6 SeriesExponential 



Input: positive integer m and real numbers oi, a2, . . . , 
Output: real numbers bo^bi, . . . ,bm such that 

bo + bix-{ h bmx"' = exp(aix H h amx"') + 0(x™+^) 

for k from 1 to m do 



return bQ,bi, . . . ,b. 



Exercise 4.42 (a) Show that Algorithm SeriesExponential computes B{x) = 
eyip{A{x)) up to terms of order a;™"*"^, where A{x) = aix + a2x'^ + • • • + CLmX^ 
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is input data and B{x) = bo + bix + • • • + hmX^ is the output. [Hint: compare 
Exercise 14.11 ] 

(b) Apply this to give an algorithm to compute the coefficients hk in Stirling's 
approximation for n! (or T{n + 1)): 



[Hint: we know the coefficients in Stirling's approximation (|4.38p for Inr(z) in 
terms of Bernoulli numbers.] 

(c) Is this likely to be useful for high-precision computation of T{x) for real 
positive X? 

Exercise 4.43 Deduce from Eqn. (j4.69p and (j4.70p an expansion of ln(4/A;) with 
error term 0{k^\og{A/k)). Use any means to figure out an effective bound on the 
0() term. Deduce an algorithm requiring x > 2"/"^ only to get n bits of Inx. 

Exercise 4.44 Show how both vr and In 2 can be evaluated using Eqn. (j4.7ip . 

Exercise 4.45 In §4.8.41 we mentioned that 202^3 and 02 + 6*1 can be computed 
using two nonscalar multiplications. For example, we could (A) compute u = 
{02 + ^3)^ and V = 6203', then the desired values are 2v and u — 2v. Alternatively, 
we could (B) compute u and w = {02 — O^f", then the desired values are {u^w)/2. 
Which method (A) or (B) is preferable? 

Exercise 4.46 Improve the constants in Table 4.1. 

Exercise 4.47 Justify Eqn. (j4.80p and give an upper bound on the constant a if 
the multiplication algorithm satisfies M(n) = Q{n'^) for some c G (1,2]. 

Exercise 4.48 (Salvy) Is the function exp(a;^) + x/{l — x^) holonomic? 

Exercise 4.49 (van der Hoeven, Mezzarobba) Improve to 0(M(n)log^n) 
the complexity given in Theorem 14. 9. 1[ 

Exercise 4.50 If = e^''*/'^, show that 




194 Modern Computer Arithmetic, version 0.5.1 of April 28, 2010 



Deduce that Sj^k, defined by Eqn. ()4.84p . satisfies 

Si k = / — ; f{z) dz 

^' 27ri ic" -2 - 1 

for j < k, where the contour C' is as in §4.101 Deduce Eqn. (I4.85p . 

Remark. Eqn. (j4.85p illustrates the phenomenon of aliasing: observations at k 

points can not distinguish between the Fourier coefficients aj, Oj+fc, aj+2fci etc. 

Exercise 4.51 Show that the sum S2j^k of ^4.101 can be computed with (essen- 
tially) only about fc/4 evaluations of / if /c is even. Similarly, show that about k/2 
evaluations of / suffice if k is odd. On the other hand, show that the error bound 
0((27r)-^) following Eqn. (f^HTp can be improved if k is odd. 

4.12 Notes and References 

One of the main references for special functions is the "Handbook of Mathematical 
Functions" by Abramowitz and Stegun [1] , which gives many useful results but no 
proofs. A more recent book is that of Nico Temme |215j . and a comprehensive 
reference is Andrews et a/. [1]. A large part of the content of this chapter comes 
from [IH] , and was implemented in the MP package [17j . In the context of floating- 
point computations, the "Handbook of Floating-Point Arithmetic" [57] is a useful 
reference, especially Chapter 11. 

The SRT algorithm for division is named after Sweeney, Robertson |190j and 
Tocher |217j . Original papers on Booth recoding, SRT division, etc., are reprinted 
in the book by Swartzlander |213j . SRT division is similar to non-restoring division, 
but uses a lookup table based on the dividend and the divisor to determine each 
quotient digit. The Intel Pentium f div bug was caused by an incorrectly initialised 
lookup table. 

Basic material on Newton's method may be found in many references, for 
example the books by Brent [41 ^ Ch. 3], Householder |126j or Traub [219j . Some 
details on the use of Newton's method in modern processors can be found in |128| . 
The idea of first computing then multiplying by y to get y^^^ ( §4.2.3p was 

pushed further by Karp and Markstein [138| . who perform this at the penultimate 
iteration, and modify the last iteration of Newton's method for y~^/^ to directly 
get y^/^ (see §1.4.51 for an example of the Karp-Markstein trick for division). For 
more on Newton's method for power series, we refer to [A3\ [521 [56l 1143^ I15H 1203] . 

Some good references on error analysis of floating-point algorithms are the 
books by Higham [121] and Muller [175] . Older references include Wilkinson's 
classics [2291123!]] . 
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Regarding doubling versus tripling: in §4.3.41 we assumed that one multiplica- 
tion and one squaring were required to apply the tripling formula (|4.19p . However, 
one might use the form sinh(33;) = 3sinh(x) + 4sinh'^(x), which requires only one 
cubing. Assuming a cubing costs 50% more than a squaring — in the FFT range 
— the ratio would be 1.51og3 2 ~ 0.946. Thus, if a specialised cubing routine is 
available, tripling may sometimes be slightly faster than doubling. 

For an example of a detailed error analysis of an unrestricted algorithm, see [69] . 

The idea of rectangular series splitting to evaluate a power series with 0{^/n) 
nonscalar multiplications ( ^4.4.3p was first published in 1973 by Paterson and 
Stockmeyer [183] . It was rediscovered in the context of multiple-precision evalua- 
tion of elementary functions by Smith [2051 §8.7] in 1991. Smith gave it the name 
"concurrent series" . Smith proposed modular splitting of the series, but classical 
splitting seems slightly better. Smith noticed that the simultaneous use of this 
fast technique and argument reduction yields 0{n^^^M{n)) algorithms. Earlier, in 
1960, Estrin [92] had found a similar technique with n/2 nonscalar multiplications, 
but O(logn) parallel complexity. 

There are several variants of the Euler-Maclaurin sum formula, with and with- 
out bounds on the remainder. See for example Abramowitz and Stegun [H Ch. 23], 
and Apostol [6]. 

Most of the asymptotic expansions that we have given in ^4.51 may be found 
in Abramowitz and Stegun [1]. For more background on asymptotic expansions 
of special functions, see for example the books by de Bruijn [83], Olver [181j and 
Wong |232j . We have omitted mention of many other useful asymptotic expansions, 
for example all but a few of those for Bessel functions [2261 1228] . 

Most of the continued fractions mentioned in §4.61 may be found in Abram- 
owitz and Stegun [1]. The classical theory is given in the books by Khinchin |140j 
and Wall |225j . Continued fractions are used in the manner described in §4.61 in 
arbitrary-precision packages such as MP [J^. A good recent reference on various 
aspects of continued fractions for the evaluation of special functions is the Handbook 
of Continued Fractions for Special Functions [83]. In particular. Chapter 7 of 
this book contains a discussion of error bounds. Our Theorem 14.6.11 is a trivial 
modification of [83l Theorem 7.5.1]. The asymptotically fast algorithm suggested 
in Exercise 14.321 was given by Schonhage |196j . 

A proof of a generalisation of (I4.54p is given in [U §4.9]. Miller's algorithm is 
due to J. C. P. Miller. It is described, for example, in [H §9.12, §19.28] and [681 
§13.14]. An algorithm is given in [102] . 

A recurrence based on (j4.60p was used to evaluate the scaled Bernoulli num- 
bers Cfc in the MP package following a suggestion of Christian Reinsch [481 §12]- 
Previously, the inferior recurrence (I4.59P was widely used, for example in [141j and 
in early versions of the MP package |471 §6.11]. The idea of using tangent numbers 
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is mentioned in \107\ §6.5], where it is attributed to B. F. Logan. Our in-place 
Algorithms TangentNumbers and SecantNumbers may be new (see Exer- 
cises I4.38fm^0]l . Kaneko jl35j describes an algorithm of Akiyama and Tanigawa 
for computing Bernoulli numbers in a manner similar to "Pascal's triangle" . How- 
ever, it requires more arithmetic operations than Algorithm TangentNumbers. 
Also, the Akiyama- Tanigawa algorithm is only recommended for exact rational 
arithmetic, since it is numerically unstable if implemented in floating-point arith- 
metic. For more on Bernoulli, Tangent and Secant numbers, and a connection with 
Stirling numbers, see Chen [6l] and Sloane [2041 A027641, A000182, A000364]. 

The Von Staudt-Clausen theorem was proved independently by Karl von Staudt 
and Thomas Clausen in 1840. It can be found in many references. If just a single 
Bernoulli number of large index is required, then Harvey's modular algorithm |_117j 
can be recommended. 

Some references on the Arithmetic-Geometric Mean (AGM) are Brent [43 ^ 146] 
I51j . Salamin |193j . the Borweins' book [36], Arndt and Haenel [7]. An early ref- 
erence, which includes some results that were rediscovered later, is the fascinating 
report HAKMEM |15j . Bernstein |19] gives a survey of different AGM algorithms 
for computing the logarithm. Eqn. (j4.70p is given in Borwein & Borwein \36\ 
(1.3.10)], and the bound (I4773D is given in [Ml P- H, Exercise 4(c)]. The AGM 
can be extended to complex starting values provided we take the correct branch 
of the square root (the one with positive real part): see Borwein & Borwein [36^ 
pp. 15-16]. The use of the complex AGM is discussed in [88]. For theta function 
identities, see [Ml Chapter 2], and for a proof of (I4.78p . see [Ml §2.3]. 

The use of the exact formula ()4.79p to compute Inx was first suggested by 
Sasaki and Kanada (see [Ml (7.2.5)], but beware the typo). See [l6] for Landen 
transformations, and [l3] for more efficient methods; note that the constants given 
in those papers might be improved using faster square root algorithms (Chapter [3]). 

The constants in Table 4.1 are justified as follows. We assume we are in 
the FFT domain, and one Fourier transform costs M(n)/3. The 13M(n)/9 ~ 
1.444M(n) cost for a real reciprocal is from Harvey |116j . and assumes M[n) ~ 
3r(2n), where T{n) is the time to perform a Fourier transform of size n. For 
the complex reciprocal l/{v + iw) = {v — iw)/{v^ + uP'), we compute + 
using two forward transforms and one backward transform, equivalent in cost to 
M{n), then one real reciprocal to obtain say x = l/(v^ + w^), then two real 
multiplications to compute vx, wx, but take advantage of the fact that we already 
know the forward transforms of v and w, and the transform of x only needs to 
be computed once, so these two multiplications cost only M{n). Thus the total 
cost is 31M(n)/9 ^ 3.444M(n). The 1.666M(n) cost for real division is from [T25l 
Remark 6], and assumes M[n) ~ 3T(2n) as above for the real reciprocal. For 
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complex division, say {t + iu)/{v + iw), we first compute the complex reciprocal 
X + iy = l/{v + iw), then perform a complex multiplication (t + iu){x + iy), 
but save the cost of two transforms by observing that the transforms of x and y 
are known as a byproduct of the complex reciprocal algorithm. Thus the total 
cost is (31/9 + 4/3)M(n) ^ 4.777Af(n). The 4M(n)/3 cost for the real square 
root is from Harvey |116j . and assumes M(n) ~ 3T(2n) as above. The complex 
square root uses Priedland's algorithm [97]: \/x + iy = w + iy/{2w) where w = 
\l (|x| + (x^ + y2)^/2)/2; as for the complex reciprocal, x^ + y^ costs M(n), then 
we compute its square root in 4M(n)/3, the second square root in 4M(n)/3, and 
the division yjw costs 1.666M(n), which gives a total of 5.333M(n). 

The cost of one real AGM iteration is at most the sum of the multiplication 
cost and of the square root cost, but since we typically perform several iterations 
it is reasonable to assume that the input and output of the iteration includes the 
transforms of the operands. The transform of a+6 is obtained by linearity from the 
transforms of a and 6, so is essentially free. Thus we save one transform or M(n)/3 
per iteration, giving a cost per iteration of 2M(n). (Another way to save M(n)/3 
is to trade the multiplication for a squaring, as explained in [1991 §8.2.5].) The 
complex AGM is analogous: it costs the same as a complex multiplication (2M(n)) 
and a complex square root (5.333M(n)), but we can save two (real) transforms 
per iteration (2Af(n)/3), giving a net cost of 6.666M(n). Finally, the logarithm 
via the AGM costs 21g(n) + 0(1) AGM iterations. 

We note that some of the constants in Table 4.1 may not be optimal. For 
example, it may be possible to reduce the cost of reciprocal or square root (Harvey, 
Sergeev) . We leave this as a challenge to the reader (see Exercise I4.46P . Note that 
the constants for operations on power series may differ from the corresponding 
constants for operations on integer s/reals. 

There is some disagreement in the literature about "binary splitting" and the 
"FEE method" of E. A. Karatsuba [137]!^ We choose the name "binary splitting" 



because it is more descriptive, and let the reader call it the "FEE method" if he/she 
prefers. Whatever its name, the idea is quite old, since in 1976 Brent [Ml Theorem 
6.2] gave a binary splitting algorithm to compute expx in time 0(M(n)(log n)^). 
The CLN library implements several functions with binary splitting |108j . and is 
thus quite efficient for precisions of a million bits or more. 

The "bit-burst" algorithm was invented by David and Gregory Chudnovsky [65] , 
and our Theorem l4.9.1l is based on their work. Some references on holonomic func- 



"'^^It is quite common for the same idea to be discovered independently several times. 
For example, Gauss and Legendre independently discovered the connection between the 
arithmetic-geometric mean and elliptic integrals; Brent and Salamin independently dis- 
covered an application of this to the computation of tt, and related algorithms were known 
to the authors of [TSl. 
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tions are J. Bernstein \25\ I26j. van der Hoeven [123j and Zeilberger |234j . See also 
the Maple gfun package |194j . which allows one, amongst other things, to deduce 
the recurrence for the Taylor coefficients of f{x) from its differential equation. 

There are several topics that are not covered in this chapter, but might have 
been if we had more time and space. We mention some references here. A useful 
resource is the website |144j . 

The Riemann zeta-function (^(s) can be evaluated by the Euler-Maclaurin ex- 
pansion ()4.34p - (j4.36p . or by Borwein's algorithm \38\ [39], but neither of these 
methods is efficient if ^(s) is large. On the critical line 3^(s) = 1/2, the Riemann- 
Siegel formula |99j is much faster and in practice sufficiently accurate, although 
only an asymptotic expansion. If enough terms are taken the error seems to be 
0(exp(— 7rt)) where t = 9(s): see Brent's review [82] and Berry's paper [28]. An 
error analysis is given in |185j . The Riemann-Siegel coefficients may be defined by 
a recurrence in terms of certain integers pn that can be defined using Euler numbers 
(see Sloane's sequence A087617 |204j ). Sloane calls this the Gabcke sequence but 
Gabcke credits Lehmer |156j so perhaps it should be called the Lehmer- Gabcke 
sequence. The sequence occurs naturally in the asymptotic expansion of 
ln(r(l/4 + zt/2)). The non-obvious fact that the pn are integers was proved by de 
Reyna [85] . 

Borwein's algorithm for C(s) can be generalised to cover functions such as the 
polylogarithm and the Hurwitz zeta-function: see Vepstas |224j . 

To evaluate the Riemann zeta-function (^{a + it) for fixed a and many equally- 
spaced points t, the fastest known algorithm is due to Andrew Odlyzko and Arnold 
Schonhage |180j . It has been used by Odlyzko to compute blocks of zeros with 
very large height t, see [178^ 1179] : also (with improvements) by Xavier Gourdon 
to verify the Riemann Hypothesis for the first 10^'^ nontrivial zeros in the upper 
half-plane, see |105| . The Odlyzko-Schonhage algorithm can be generalised for the 
computation of other L-functions. 

In ^4.10l we briefiy discussed the numerical approximation of contour integrals, 
but we omitted any discussion of other forms of numerical quadrature, for exam- 
ple Romberg quadrature, the tanh rule, the tanh-sinh rule, etc. Some references 
are [m |I2l [H [951 (Ml [211], and [371 §7.4.3]. For further discussion of the con- 
tour integration method, see [157] . For Romberg quadrature (which depends on 
Richardson extrapolation), see [58 ] 11891 1192| . For Clenshaw-Curtis and Gaussian 
quadrature, see [67] [931 1220j . An example of the use of numerical quadrature to 
evaluate r(x) is [321 p. 188]. This is an interesting alternative to the method based 
on Stirling's asymptotic expansion ()4.5p . 

We have not discussed the computation of specific mathematical constants such 
as TT, 7 (Euler's constant), C(3), etc. vr can be evaluated using vr = 4arctan(l) and a 
fast arctan computation ( §4.9.2p : or by the Gauss-Legendre algorithm (also known 
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as the Brent- Salamin algorithm), see H6 l [T93] . This asymptoticahy fast algo- 
rithm is based on the arithmetic-geometric mean and Legendre's relation (j4.72p . A 
recent record computation by Bellard [H] used a rapidly-converging series for 1 /vr 
by the Chudnovsky brothers |64j . combined with binary splitting. Its complexity 
is 0{M(n)\o^ n) (theoretically worse than Gauss-Legendre's 0(M(n) log n), but 
with a small constant factor). There are several popular books on tt: we men- 
tion Arndt and Haenel [7]. A more advanced book is the one by the Borwein 
brothers [36] . 

For a clever implementation of binary splitting and its application to the fast 
computation of constants such as vr and ^(3) — and more generally constants 
defined by hypergeometric series — see Cheng, Hanrot, Thome, Zima and Zim- 
mermann |62l [63] . 

The computation of 7 and its continued fraction is of interest because it is not 
known whether 7 is rational (though this is unlikely). The best algorithm for com- 
puting 7 appears to be the "Bessel function" algorithm of Brent and McMillan [53] , 
as modified by Papanikolaou and later Gourdon |106] to incorporate binary split- 
ting. A very useful source of information on the evaluation of constants (including 
TT, e, 7, In 2, C(3)) and certain functions (including T{z) and C(s)) is Gourdon and 
Sebah's web site |106] . 

A nice book on accurate numerical computations for a diverse set of "SIAM 
100-Digit Challenge" problems is Bornemann, Laurie, Wagon and Waldvogel [32]. 
In particular, Appendix B of this book considers how to solve the problems to 
10, 000-decimal digit accuracy (and succeeds in all cases but one). 



Chapter 5 

Implementations and Pointers 



Here we present a non-exhaustive list of software packages 
that (in most cases) the authors have tried, together with 
some other useful pointers. Of course, we can not accept any 
responsibility for bugs/errors/omissions in any of the software 
or documentation mentioned here — caveat emptor! 

Websites change. If any of the websites mentioned here disappear 
in the future, you may be able to find the new site using a search 
engine with appropriate keywords. 



5.1 Software Tools 



5.1.1 CLN 



CLN (Class Library for Numbers, |http: //www. ginac.de/CLN/) is a library 



for efficient computations with all kinds of numbers in arbitrary precision. 
It was written by Bruno Haible, and is currently maintained by Richard 
Kreckel. It is written in C-I--I- and distributed under the GNU General Pub- 
lic License (GPL). CLN provides some elementary and special functions, and 
fast arithmetic on large numbers, in particular it implements Schonhage- 
Strassen multiplication, and the binary splitting algorithm |108] . CLN can 
be configured to use GMP low-level mpn routines, which improves its per- 
formance. 
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5.1.2 GNU MP (GMP) 

The GNU MP library is the main reference for arbitrary-precision arith- 
metic. It has been developed since 1991 by Torbjorn Granlund and several 
other contributors. GNU MP (GMP for short) implements several of the 
algorithms described in this book. In particular, we recommend reading the 
"Algorithms" chapter of the GMP reference manual [104] . GMP is written 
in C, is released under the GNU Lesser General Public License (LGPL), and 



is available from http : / / gmplib . orgl 



GMP's MPZ class implements arbitrary-precision integers (corresponding 
to Chapter [1]), while the mpf class implements arbitrary-precision floating- 
point numbers (corresponding to Chapter [3])0 The performance of GMP 
comes mostly from its low-level MPN class, which is well designed and highly 
optimized in assembly code for many architectures. 

As of version 5.0.0, MPZ implements different multiplication algorithms 
(schoolbook, Karatsuba, Toom-Cook 3-way, 4-way, 6-way, 8-way, and FFT 
using Schonhage-Strassen's algorithm); its division routine implements Al- 
gorithm RecursiveDivRem ( §1.4.3p in the middle range, and beyond that 
Newton's method, with complexity 0{M{n)), and so does its square root, 
which implements Algorithm SqrtRem, since it relies on division. The New- 
ton division first precomputes a reciprocal to precision n/2, and then per- 
forms two steps of Barrett reduction to precision n/2: this is an integer 
variant of Algorithm Divide. It also implements unbalanced multiplica- 
tion, with Toom-Cook (3,2), (4,3), (5,3), (4,2), or (6,3) [31]. Function 
mpn_ni_invertappr, which is not in the public interface, implements Algo- 
rithm ApproximateReciprocal ( §3.4.ip . GMP 5.0.0 does not implement 
elementary or special functions (Chapter H]), nor does it provide modular 
arithmetic with an invariant divisor in its public interface (Chapter [2]). How- 
ever, it contains a preliminary interface for Montgomery's REDC algorithm. 

MPIR is a "fork" of GMP, with a different license, and various other 
differences that make some functions more efficient with GMP, and some with 
MPIR; also, the difficulty of compiling under Microsoft operating systems 
may vary between the forks. Of course, the developers of GMP and MPIR 
are continually improving their code, so the situation is dynamic. For more 



on MPIR, see http://www.mpir.org/ 



^However, the authors of GMP recommend using MPFR (see §5.1.4|) for new projects. 
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5.1.3 MPFQ 

MPFQ is a software library developed by Pierrick Gaudry and Emmanuel 
Thome for manipulation of finite fields. What makes MPFQ different from 
other modular arithmetic libraries is that the target finite field is given at 
compile time, thus more specific optimizations can be done. The two main 
targets of MPFQ are the Galois fields and ¥p with p prime. MPFQ is 
available from |http: //www.mpf q. org/^ and is distributed under the GNU 
Lesser General Public License (LGPL). 



5.1.4 MPFR 

MPFR is a multiple-precision binary floating-point library, written in C, 
based on the GNU MP library, and distributed under the GNU Lesser Gen- 
eral Public License (LGPL). It extends the main ideas of the IEEE 754 stan- 
dard to arbitrary-precision arithmetic, by providing correct rounding and 
exceptions. MPFR implements the algorithms of Chapter [3] and most of 
those of Chapter m including all mathematical functions defined by the ISO 
C99 standard. These strong semantics are in most cases achieved with no 
significant slowdown compared to other arbitrary-precision tools. For details 
of the MPFR library, see |http : //www . mpf r . org| and the paper [96] . 



5.1.5 Other Multiple-Precision Packages 

Without attempting to be exhaustive, we briefly mention some of MPFR's 
predecessors, competitors, and extensions. 

1. ARPREC is a package for multiple- precision floating-point arithmetic, 
written by David Bailey et al. in C+-|-/Fortran. The distribution 
includes The Experimental Mathematician's Toolkit which is an inter- 
active high-precision arithmetic computing environment. ARPREC is 



available from http : / / crd . Ibl . gov/~dhbailey/mpdist/ 



2. MP |17] is a package for multiple-precision floating-point arithmetic 
and elementary and special function evaluation, written in Fortran??. 
MP permits any small base /3 (subject to restrictions imposed by the 
word-size), and implements several rounding modes, though correct 
rounding-to-nearest is not guaranteed in all cases. MP is now obsolete, 
and we recommend the use of a more modern package such as MPFR. 
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However, much of Chapter H] was inspired by MP, and some of the 
algorithms implemented in MP are not yet available in later packages, 
so the source code and documentation may be of interest: see |http : / /] 
rpbrent . com/pub/pub043 . html. 

MPC ^http : / / www . mult iprecis ion . org/^ is a C library for arith- 



metic using complex numbers with arbitrarily high precision and cor- 
rect rounding, written by Andreas Enge, Philippe Theveny and Paul 
Zimmermann [90]. MPC is built on and follows the same principles as 
MPFR. 

MPFI is a package for arbitrary-precision floating-point interval arith- 
metic, based on MPFR. It can be useful to get rigorous error bounds 
using interval arithmetic. See http://mpfi.gforge.inria.fr/, and 
also ^5.31 



5. Several other interesting/useful packages are listed under "Other Re- 
lated Free Software" at the MPFR website |http : / / www . mpf r . org/j 



5.1.6 Computational Algebra Packages 

There are several general-purpose computational algebra packages that in- 
corporate high-precision or arbitrary-precision arithmetic. These include 
Magma, Mathematica, Maple and Sage. Of these. Sage is free and open- 
source; the others are either commercial or semi-commercial and not open- 
source. The authors of this book have often used Magma, Maple and Sage 
for prototyping and testing algorithms, since it is usually faster to develop an 
algorithm in a high-level language (at least if one is familiar with it) than in 
a low-level language like C, where one has to worry about many details. Of 
course, if speed of execution is a concern, it may be worthwhile to translate 
the high-level code into a low-level language, but the high-level code will be 
useful for debugging the low-level code. 

1. Magma (I http : / / magma . maths . usyd . edu . au/ magma/ [ ) was developed 
and is supported by John Cannon's group at the University of Syd- 
ney. Its predecessor was Cayley, a package designed primarily for 
computational group theory. However, Magma is a general-purpose 
algebra package with logical syntax and clear semantics. It includes 
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arbitrary-precision arithmetic based on GMP, MPFR and MPC. Al- 
though Magma is not open-source, it has excellent online documenta- 
tion. 

2. Maple (http://www.maplesoft.com) is a commercial package origi- 
nally developed at the University of Waterloo, now by Waterloo Maple, 
Inc. It uses GMP for its integer arithmetic (though not necessarily the 
latest version of GMP, so in some cases calling GMP directly may be 
significantly faster). Unlike most of the other software mentioned in 
this chapter. Maple uses radix 10 for its floating-point arithmetic. 

3. Mathematica is a commercial package produced by Stephen Wolfram's 
company Wolfram Research, Inc. In the past, public documentation 
on the algorithms used internally by Mathematica was poor. However, 
this situation may be improving. Mathematica now appears to use 
GMP for its basic arithmetic. For information about Mathematica, see 
http: //www. wolf ram. com/products/mathematica/ ^ 

4. NTL (http : / / www . shoup . net/ntl/ [) is a C++ library providing data 
structures and algorithms for manipulating arbitrary-length integers, 
as well as vectors, matrices, and polynomials over the integers and 
over finite fields. For example, it is very efficient for operations on 
polynomials over the finite field F2 (that is, GF(2)). NTL was written 
by and is maintained by Victor Shoup. 

5. PARI/GP (http://pari.math.u-bordeaux.fr/) is a computer alge- 
bra system designed for fast computations in number theory, but also 
able to handle matrices, polynomials, power series, algebraic numbers 
etc. PARI is implemented as a C library, and GP is the scripting 
language for an interactive shell giving access to the PARI functions. 
Overall, PARI is a small and efficient package. It was originally devel- 
oped in 1987 by Christian Batut, Dominique Bernardi, Henri Cohen 
and Michel Olivier at Universite Bordeaux I, and is now maintained by 
Karim Belabas and many volunteers. 

6. Sage ( http: //www, sagemath. org/[ ) is a free, open-source mathemat- 



ical software system. It combines the power of many existing open- 
source packages with a common Python-based interface. According 
to the Sage website, its mission is "Creating a viable free open-source 
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alternative to Magma, Maple, Mathematica and Matlab". Sage was 
started by William Stein and is developed by a large team of volun- 
teers. It uses MPIR, MPFR, MPC, MPFI, PARI, NTL, etc. Thus, it 
is a large system, with many capabilities, but occupying a lot of space 
and taking a long time to compile. 



5.2 Mailing Lists 

5.2.1 The BNIS Mailing List 

The BNIS mailing list was created by Dan Bernstein for "Anything of inter- 
est to implementors of large-integer arithmetic packages" . It has low traffic 
(a few messages per year only). See http : //cr . yp . to/lists . htmlj to sub- 



scribe. An archive of this list is available at [http : / / www . nabble . com/ cr . 
yp . to bnis-f 846 . html. 



5.2.2 The GMP Lists 

There are four mailing lists associated with GMP: gmp-bugs for bug reports; 
gmp-announce for important announcements about GMP, in particular new 
releases; gmp-discuss for general discussions about GMP; gmp-devel for 
technical discussions between GMP developers. We recommend subscription 
to gmp-announce (very low traffic), to gmp-discuss (medium to high traf- 
fic), and to gmp-devel only if you are interested in the internals of GMP. 
Information about these lists (including archives and how to subscribe) is 
available from ^http : //gmplib . org/mailman/ list info/ . 



5.2.3 The MPFR List 



There is only one mailing list for the MPFR library. See http://www.mpfr. 
rorgi to subscribe or search through the list archives. 



5.3 On-line Documents 

The NIST Digital Library of Mathematical Functions (DLMF) is an ambi- 
tious project to completely rewrite Abramowitz and Stegun's classic Hand- 
book of Mathematical Functions [I]. It will be published in book form by 
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Cambridge University Press as well as online at |http : / / dlmf . nist . gov/ 



As of February 2010 the project is incomplete, but still very useful. For ex- 
ample, it provides an extensive online bibliography with many hyperlinks at 
http : / / dlmf . nist . gov/bib/. 



The Wolfram Functions Site http : / / functions . wolfram . com/ contains 



a lot of information about mathematical functions (definition, specific values, 
general characteristics, representations as series, limits, integrals, continued 
fractions, differential equations, transformations, and so on). 

The Encyclopedia of Special Functions (ESF) is another nice web site, 
whose originality is that all formulae are automatically generated from very 
few data that uniquely define the corresponding function in a general class 
|164] . This encyclopedia is currently being reimplemented in the Dynamic 
Dictionary of Mathematical Functions (DDMF); both are available from 
http: // algo . inria. f r/online .html 



A large amount of information about interval arithmetic (introduction, 
software, languages, books, courses, applications) can be found on the Inter- 
val Computations page jhttp : //www . cs . utep . edu/ interval- comp/ , 

Mike Cowlishaw maintains an extensive bibliography of conversion to and 
from decimal arithmetic at http://speleotrove.com/decimal/, 

Useful if you want to identify an unknown real constant such as 1.414213 ■ ■ • 
is the Inverse Symbolic Calculator (ISC) by Simon Plouffe (building on earlier 
work by the Borwein brothers) at ^http: //oldweb . cecm. sf u. ca/project^7] 

Finally, an extremely useful site for all kinds of integer/rational sequences 
is Neil Sloane's Online Encyclopaedia of Integer Sequences (OEIS) at |http:// 
www . research . att . com/~nj as/ sequences/! 
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for exp, 11861 

for subtraction, 11021 

for summation, 11491 

negative, 11761 

Haenel, Christoph, 11991 
Haible, Bruno, EOU 

HAKMEM, nm 

HalfBezout, |33] 
HalfBinaryGcd, EZl E] 
HalfGcd, m 

Hanek, Robert N.. [TIM[T^ 
Hankerson, Barrel Richard, [84] 
Hanrot , Guillaume, [ill [43] [H [48] [1311 
[T991 

harmonic number, lxiii[ 11881 

Hars, Laszlo, [83] 

Harvey, David, [13 [13 [IHl \IM M 

[T931 [T961 [T97I 
Hasan, Mohammed Anwarul, [IS] 
Hasenplaugh, William, [83] 
Hastad, Johan Torkel, [IS] 
Hellman, Martin Edward, [74] 
Hennessy, John LeRoy, 11301 



Hensel 

division, [26H271 [491 MMi [22] 

lifting, [2S1 [2a [33 [la [sa [zn 

Hensel, Kurt Wilhelm Sebastian, [53] 
Heron of Alexandria, 11401 
Higham, Nicholas John, [HH MM 
Hille, Einar Carl, [TMI 
Hoeven, see van der Hoeven 
holonomic function, [1511 [HB IM [I98] 
Hopcroft, John Edward, [13 [HI 
Horner's rule, [UHl [152 fT55] 

forward, 11861 
Horner, William George, 11481 
Householder, Alston Scott, 11941 
Hull, Thomas Edward, 11311 
Hurwitz zeta-function, 11981 
Hurwitz, Adolf, Um 
hypergeometric function, I151[ I172[ 11811 

idempotent conversion, 11331 
IEEE 754 standard, ES\ 

extension of, 12031 
IEEE 854 standard, UWl 

infinity, M 
INRIA, EH 
integer 

notation for, Ixvil 
integer division 

notation for, Ixivl 
integer sequences, 12071 
interval arithmetic, [2M [2U7] 
inversion 

batch, M 

modular, [351 [ZDHH M 
lordache, Cristina S.. 11311 
ISC. [2071 

Ispiryan, Karen R.,[13] 
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Jacobi symbol, HTJ [50] 

notation for, Ixivl 

sub quadratic algorithm, HTJ [50] 
Jacobi, Carl Gustav Jacob, HT] 
Jebelean, Tudor, [49] 
Johnson, Jeremy Russell, 11311 
Jones, Wilham B..IT95] 
Jullien, Graham A.. [84] 

Kahan, William Morton, 
Kaihara, Marcelo Emilio, [83] 
Kanada, Yasumasa, I176[ 11961 
Kaneko, Masanobu, 11961 
Karatsuba's algorithm, \M [H 
[181 [671 [T77I 

in-place version, [H] 

threshold for, [H] 
Karatsuba, Anatolii Alekseevich, W5\ 

mmiim 

Karatsuba, Ekatherina Anatolievna, 

mi 

Karp, Alan Hersh, M M [ffl MM 
Karp-Markstein trick, [Ml \M MM 
Kayal, Neeraj, Ml 
Khachatrian, Gurgen H., [43] [48] 
Khinchin, Aleksandr Yakovlevich, Ml 

mi 

Kidder, Jeffrey Nelson, mi 

Knuth, Donald Ervin, [iH [48] [49] [TSTl 

[mi[T^ 
Koornwinder, Tom Hendrik, 11981 
Krandick, Werner, Ml [TM] 
Kreckel, Richard Bernd, 12011 
Kronecker, Leopold, Ml 
Kronecker-Schonhage trick, [3l Ml Ml 

Kulisch, Ulrich Walter Heinz, 11331 
Kung, Hsiang Tsung, Ml [TM[ 



Kuregian, Melsik K.,M1 
Kuz'min, Rodion Osievich, [49] 

Lagrange interpolation, [7] [80] 

Lagrange, Joseph Louis, [7] 

Landen transformations, 11771 11961 

Landen, John, 11771 

Lang, Tomas, 11311 

Laurie, Dirk, [1981 [199] 

lazy algorithm, [2] [IS 

leading zero detection, 11021 

Lecerf, Gregoire, 11311 

Lefevre, Vincent, Ml mi mi 

Legendre, Adrien-Marie, I171[ 11991 

Lehmer, Derrick Henry, [321 Ml 11981 

Lehmer-Gabcke sequence, 11981 

Lenstra, Arjen Klaas, Ml 

Lenstra, Hendrik Willem, Jr.. [49] 

level-index arithmetic, 11301 

Ig, see logarithm 

LGPL, [2021 [203] 

Lickteig, Thomas Michael, 11331 

lists versus arrays, [91] 

little o notation, |xv] 

In, see logarithm 

Loan, see Van Loan 

log, see logarithm 

loglp, see logarithm 

Logan, Benjamin Franklin "Tex", Jr., 
[T96] 

logarithm 

addition formula, 11441 
computation via AGM, 11731 
lg(a;), ln(x), logfo:). Ixvl 

loglp, mi mi 

notations for, |xv] 
Sasaki-Kanada algorithm, 11751 
logical operations, Ixivl 
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LSB, El EH EH [53] 
Luschny, Peter, [46] 
Lyness, James N., 11981 

machine precision, Ixivl 
Maeder, Roman Erich, HH 
Magaud, Nicolas, |49] 
Magma, HM 
maihng hsts, 12061 
mantissa, see significand 
Maple, UnHl EQS] 

Markstein, Peter, E2 SH EM HH 

Martin, David W.,EM 

MasPar, M 

Massey, James Lee, US] 

Mathematica, 12051 

Mathematics Genealogy Project, Ixll 

matrix multiplication, |45l 11331 

matrix notation, |xv] 

Matula, David William, [IH] 

Maze, Gerard, H9] 

MCA,[M] 

McLaughlin's algorithm, [621 [631 [68] - 

[701 [83] 
polynomial version, [83] 
McLaughlin, Philip Burtis, Jr., [H] 

[M1[M] 

McMillan, Edwin Mattison, [T99] 

Menezes, Alfred John, [SH 

Menissier-Morain, Valerie, 11301 

Mezzarobba, Marc, [xT] 11931 

Microsoft, [202] 

middle product, [241 [451 [T09] 

Mihailescu, Preda V.. [55] 

Mikami, Yoshio, [H] 

Miller's algorithm, [W] [IHUl EM 

Miller, Jeffrey Charles Percv. [T671 [T95] 

Miller, Wilham C.,M 



mod notation, Ixivl 
modular 

addition, [54] 

division, [70] 

exponentiation, [HHZSl [HI 
base 2^ [TS 

inversion, EH [TUHZll [82] 

multiplication, [631470] 

splitting, 11541 

subtraction, [SH 
modular arithmetic 

notation for, Ixivl 

special moduli, [TOl [TH El 
modular representation, [79] 

comparison problem, [81] 

conversion to/ from. [7^ 

redundant, [HH 

sign detection problem, [81] 
Moenck, Robert Thomas, [lH El 
Moler, Cleve Barry, EM 
M511er, Niels, [13 [S] [IHl El 
Montgomery's algorithm, [B5] 
Montgomery's form, [52] [65] 
Montgomery multiplication, [65H68] 

subquadratic, [67[ 
Montgomery reduction, [27] [SSI 
Montgomery, Peter Lawrence, [IS] [521 

E31E1 

Montgomery-Svoboda algorithm, [53] 

[66H681 [821 E3] 
Mori, Masatake, EM 
MF. EMEmW 
MPC. [2041 [205] 
MPFI, EOl 

MPFQ, Eoa 

MPFR, E031 E05] 
MPIR, [202] 

MSB, E31 El EH ED E3] 
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Mulders, Thorn, M [1281 [EH 
Muller, Jean-Michel, EE fT3TtiT33l ITOl 
multiplication 

by a constant, 

carry bit, |44] 

complex, 11771 

FFT range, 1 

Fiirer's algorithm, | 

Karatsuba's algorithm, 11771 

modular, 

of integers, 

of large integers, 

Schonhage-Strassen, 

schoolbook, |5] 

short product, 11031 

time for, M(n), Ixivl 

unbalanced, 
complexity of, | 

via complex FFT, 11071 
multiplication chain, [71] 

weighted, 1551 
Munro, (James) Ian, [8^ 



NaN,[88] 

quiet, 

signalling, 
nbits, |xv] 

nearest integer function [x] , Ixvl 
Neumann, Carl Gottfried, 11661 
Newton's method, [231 [23 [2H1 [Ml [ZH 
[ml [T2il [T35HT431 [T94I 

for functional inverse, 11411 11511 

for inverse roots, 11381 

for power series, 11401 

for reciprocal, 11381 

for reciprocal square root. 11391 

higher order variants, 11421 

Karp-Marstein trick, 11941 



p-adic (Hensel lifting) 
Newton, Isaac, [231 Mi [IIIl [133 
Nicely, Thomas R.. 11391 
NIST, M 

NIST Digital Library, 
normalized divisor, 



Not a Number (NaN), [88] 
Nowka, Kevin John, 11311 
NTL,[2nS] 

numerical differentiation, 11981 

numerical instability 
in summation, 11501 
recurrence relations, 11681 

numerical quadrature, see quadrature 

Nussbaumer, Henri Jean, [84] 



odd zet a- function, 11701 
odd-even scheme, ^ [48] [IMl [US] 
Odlyzko, Andrew Michael, [T98] 
Odlyzko-Schonhage algorithm, 11981 
OEIS,[207] 

off-line algorithm, [21 W8\ 

Olivier, Michel, [205] 

Olver, Frank William John, [1201 MSI 

Omega notation Q, |xv] 

on-line algorithm, [48] 

Oorschot, see van Oorschot 

ord, Ixvl 

Osborn, Judy-anne Heather, [xi] 



Paar, Christof, 
p-adic, [53] 

Pan, Victor Yakovlevich, 11311 
Papanikolaou, Thomas, 11991 

pARi/GP, mm 

Patashnik, Oren, 11961 
Paterson, Michael Stewart, 11951 
Patterson, David Andrew, 11301 
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Payne and Hanek 

argument reduction, 11091 11321 
Payne, Mary H., [M [131 
Pentium bug, ^M, [Ml 
Percival, Colin Andrew, EM [M [M 

m\ 

Petermann, Yves- Francois Sapphorain, 
[T98] 

Petersen, Vigdis Brevik, 11951 
phi function </>, Ixivl 

Brent-Salamin al gor it hm . \17'd\ [TIJB] 

Chudnovsky series, 11991 

Gauss-Legendre algorithm, 11731 

record computation, 11991 
Pila, Jonathan S.. 
Plouffe, Simon, [203 
Pollard, John Michael, [831 M 
polylogarithm, 11981 
polynomial evaluation, 11531 
Pomerance, Carl 



power 

computation of, [TH 
detection of 



power series 

argument reduction, 11521 
assumptions re coefficients, 11511 
backward summation, I146[ 11491 
direct evaluation, 11521 
forward summation, I146[ 11491 
radius of convergence, 11511 

precision, Ixivl 

local/global, [911 
machine, 11491 

operand/operation, [911 11311 
reduced, 11761 
working, [HHl IHHI 
Priest, Douglas M., Ml [SH 



product tree, [751 
pseudo-Mersenne prime, [701 [HH 
PV /, see Cauchy principal value 
Python, [2051 

quadrature 

Clenshaw-Curtis, 11981 

contour integration, 11841 

Gaussian, 11981 

numerical, 11981 

Romberg, 11981 

tanh-sinh, 11981 
Querela, Michel, [H [13 imi 
Quisquater, Jean- Jacques 
quotient selection, [181 [El 



Rader, Charles M., 
radix, MM 

choice of, [871 

mixed, M 

radix ten, \TM 
rational reconstruction, [iQl 
reciprocal square root, I121| 11391 
rectangular series splitting, I153til56[ 
[1951 

recurrence relations, 11651 

REDc, [03 m 

redundant representation 

for error detection/correction, [HTl 
for exponentiation, [7S1 
for modular addition, [021 

Reinsch, Christian, 11951 

relaxed algorithm, [2l W8\ 

relaxed multiplication, [H21 

remainder tree, HH [751 

Remy, Jean-Luc, 11981 

residue class representation, [51] 

residue number system, [031 [701 1331 
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Reyna, Juan Arias de, 11981 
Richardson extrapolation, 11981 
Richardson, Lewis Fry, 11981 
Riemann Hypothesis 

computational verification, 11981 
Riemann zeta-function, 11591 11991 

at equally spaced points, 11981 

at even integers, 11701 

Bernoulli numbers, 11711 

Borwein's algorithm, 11981 

error analysis, 11981 

Euler-Maclaurin expansion, 11591 
(1981 

odd zeta-function, 11701 
Odlyzko-Schonhage algorithm, 11981 
Riemann-Siegel formula, 11981 
Riemann, Georg Friedrich Bernhard, 

m\ 

Rivest, Ronald Linn, [71] 
RNS, see residue number system 
Robertson, James Evans, 11941 
Roche, Daniel Steven, |44] 
Roegel, Denis, |xi] 
Romberg quadrature, 11981 
Romberg, Werner, 11981 
root 

k-th, m 

Goldschmidt's iteration, 11321 
inverse, 11381 
principal, |55] 
square, [2^3 MS\ 
complex, [ISa [Ml 
paper and pencil, [27] 
wrap-around trick, 11241 
Rosser, John Barkley, 11981 
rounding 

away from zero 
boundary 



correct, [H21 [UHl 

double, M 

mode, [91 [130] 

notation for, Ixivl 

probabilistic, [91] 

round bit, [96] 

sticky bit, [96] [M [T30] 

stochastic, [91] 

strategies for, [98] 

to nearest, [Ml [MM 
balanced ternary, 11281 

to odd, [T28] 

towards zero, [91 [128] 

Von Neumann, 11281 
rounding mode o, [U21 - [99] 
Roy, Ranjan,[l92[l95] 
RSA cryptosystem, [74] 
runs of zeros/ones, 11311 
Ryde, Kevin, [H 



Sage, [205] 

Salamin, Eugene, I196[ 11991 

Salvy, Bruno, 11931 

Sasaki, Tateaki, [T76l [T96] 

Saxena, Nitin, [IH] 

Schmid, Wolfgang Alexander, [xi] 

Schmookler, Martin S.. 11311 

Sch5nhage, Arnold, ^ MWi [1321 

[T861 [T951 [T98] 
Schonhage-Strassen algorithm. [54 ] [60 ] 

[701 [841 [TT51 [ml [20T] 
Schost, Eric, [m] UM 
Schroeppel, Richard Crabtree, 11961 
Sebah, Pascal, 11991 
secant numbers, 11701 11921 
Sedjelmaci, Sidi Mohamed, [xi] 
Sedoglavic, Alexandre, [16] 
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Stehle, Damien, gTl M 
Stein, Josef, |49] 
Stein, William Arthur, [2061 
Sterbenz's theorem, 11021 11311 
Sterbenz, Pat Holmes, M [IM] 
sticky bit. Ml M 
Shokrollahi, Mohammad Amin. H5 | ll33I Stirling numbers, I189[ 11961 



segmentation, see Kronecker- 

Schonhage trick 
Sergeev, Igor S.. 11971 
Shallit, Jeffrey Outlaw, 
Shamir, Adi, [H] 
Shand, Mark Alexander, 



short division, 11311 

short product. Ml fT03HT06l fT3T] 

Shoup, Victor John, |46l [205] 

Sieveking, Malte, [T94l 

sign, Ev] 

sign-magnitude. El |52l [HI |9i 
significand, [851 l90l 
sin(x),[T4l 
sinh(x),[I17| 

sliding window algorithm, [771 
Sloane, Neil James Alexander, 12071 
Smith's method, see rectangular 

series splitting 
Smith, David Michael, [195] 
software tools, 12011 
Sorenson, Jonathan Paul, [33] [49] [831 



special function, [T35HT991 [206ti2071 
special moduli, [TD] [ZH [HI 
splitting 

classical, 11541 

modular, 11541 
square root, see root 
squaring, [T21 [IS] 

complex, 11771 
SRT division, [1361 [Ell [T9l 

Staudt, Karl Georg Christian von, [1691 Svoboda's algorithm. [TS] 

[T96] [531 [661 [681 [83] 

Steel, Allan, [48] Svoboda, Antonin, [49] [53] 

Steele, Guy Lewis, Jr.. 11321 Swartzlander, Earl E., Jr. 

Stegun, Irene Anne, I194[ I195[ 12061 Sweeney, Dura Warren, 11941 



Stirling's approximation 
convergent form, 11891 
for \nT{iy),m 
for \nT{x),M2\ 
for \nT{ziw 

for n! or T{z),EM (UHl [1601 

[T92l[T98l 
with error bounds, 11581 
Stirling, James, 11451 
Stockmeyer, Larry Joseph, 11951 
Stoer, Josef, [T98] 
Strassen's algorithm, [40] 
Strassen, Volker, [40] [T33] 
strings 

concatenation, Ixvil WT\ 
subnormal numbers, [88] 

smallest, Ixivl 
substitution, see Kronecker- 

Schonhage trick 
subtraction, [2] M 
guard digits, 11021 
leading zero detection, 11021 
modular, [5^ 
summation 

backward, [HQ [US 
forward, [1461 



[TMl 
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Tables 

Table 3.1, [96] 

Table 3.2, [TOTl 

Table 3.3, UM 

Table 4.1, [TTH] 
Takagi, Naofumi, [551 
Takahasi, Hidetosi, 11981 

tan(x). nmnm\ 

tangent numbers, lxiii[ I169[ I19H 11961 

algorithm for, 11691 

complexity of evaluation, 11921 

space required for, 11911 
Tellegen's principle, 11311 
Temme, Nico M., MM [1981 
tensor rank, SSI 11331 
ternary system, 11281 
theta functions, 11741 
Theta notation 0, Ixvl 
Theveny, Philippe, EQl 
Thome, Emmanuel, EH El 113 [HH 

Tocher, Keith Douglas, 11941 
Toom, Andrei Leonovich, HHl 
Toom-Cook multiplication, El 

m 

time for, [7] 
totient function, Ixivl 
Traub, Joseph Frederick, 11941 
Trefethen, (Lloyd) Nicholas, ESHl (1991 
tripling formula 

for sin, 11441 

for sinh, 11471 

in FFT range, EZl [US] 

UUman, Jeffrey David, \M M 
unbalanced multiplication, [914T21 [15] 
unit in the last place (ulp), Ixivl [HSl 

m 



unrestricted algorithm, 11301 11361 
for exp, 11951 

Vallee, Brigitte, M 
valuation, Ixivl 

van der Hoeven, Joris, [HI EM [M 

[T931IT98] 
Van Loan, Charles Francis, [Ml 
van Oorschot, Paul Cornells, [8^ 
Vandermonde matrix, [7] 
Vanstone, Scott Alexander, [HH 
vectors, notation for, |xv] 
Vepstas, Linas, 11981 
Verdonk, Brigitte, 11951 
Vetter, Herbert Dieter Ekkehart, 11321 
Vidunas, Raimundas, 11981 
Von Neumann, John (Janos Lajos), 

[na 

Von Staudt-Clausen theorem. 11691 1196] 
von zur Gathen, Joachim, [83] 
Vuillemin, Jean Etienne, [lH [IH] 

Waadeland, Haakon, 11951 
Wagon, Stanley fStan). [T98l [T99] 
Waldvogel, Jorg, [1981 MB 
Wall, Hubert Stanley, MSI 
Wang, Paul Shyh-Horng, Ml 
Watson, George Neville, 11951 
Weber functions, Y„(x). [T66] 
Weber, Heinrich Friedrich, 11661 
Weber, Kenneth, M\ 
Weimerskirch, Andre, E 
Wezelenburg, Mark, [xH 
White, Jim, IW] 
White, Jon L..[T^ 
Whittaker, Edmund Taylor, 11951 
Wilkinson, James Hardv. [ml [T94l [T95] 
Winograd, Shmuel, [HI] 
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Wolfram, Stephen, 12051 
Wong, Roderick, 11951 
wrap-around trick, |65l 11141 

Yap, Chee-Keng, HQ] 

Zanoni, Alberto, 148) 
zealous algorithm, HH] 
Zeilberger, Doron, 11981 
zero, ±0, EH M 

Ziegler, Joachim, HQ] 
Zima, Eugene, 11991 
Zimmermann, Marie, Ixi] 
Zimmermann, Paul Vincent, HH H9| 

[TM (19911201 
Ziv's algorithm, [93] 
Zuras, Dan, |13 M 



Summary of Complexities 



Integer Arithmetic (ra-bit or (m, ?7,)-bit input) 


Addition, Subtraction 


0{n) 


Multiplication 


M{n) 


Unbalanced Multiplication (m > n) 


M{m,n) < [™]M(n),M(™+") 


Division 


0{M{n)) 


Unbalanced Division (with remainder) 


D{m + n,n) = 0{M{m, n)) 


Square Root 


0{M{n)) 


A;-th Root (with remainder) 


0{M{n)) 


GCD, extended GCD, Jacobi symbol 


0{M{n) \ogn) 


Base Conversion 


0{M{n) \ogn) 



Modular Arithmetic (n-bit modulus) 


Addition, Subtraction 


0{n) 


Multiplication 


M{n) 


Division, Inversion, Conversion to/from RNS 


0{M{n) \ogn) 


Exponentiation (/c-bit exponent) 


0{kM{n)) 



Floating-Point Arithmetic (n-bit input and output) 


Addition, Subtraction 


0{n) 


Multiplication 


M{n) 


Division 


0{M{n)) 


Square Root, k-ih Root 


0{M{n)) 


Base Conversion 


0{M{n) \ogn) 


Elementary Functions 
(in a compact set 
excluding zeros and poles) 


0{M{n) \ogn) 



